A schematic diagram of the proposed research framework is presented in Fig. 1. Firstly, the outliers of the origin water demand time series are identified by the use of 3σ criterion. The moving average method is applied to smooth the identified outliers. Then, according to the seasonal characteristics of the data series, the Seasonal and Trend decomposition using the Loess (STL) method (Antunes et al. 2018; Chen et al. 2022) is adopted to extract the seasonal, trend and residual features of the smoothed series respectively. Thus, the time series is decomposed into three components including trend series Tt, seasonal series St and residual series Rt as shown in Fig. 1. Considering the heterogeneity and strong forecast capabilities of LSTM, and the AdaBoost model that have independent error distributions, a forecast method that combines these two named AdaBoost-LSTM, is developed to improve the overfitting problem of LSTM and further improve the forecast accuracy. The decomposed series are regarded as the inputs of three AdaBoost-LSTM models respectively. The first AdaBoost Model, displayed on the left of Fig. 1, is designed to forecast the trend of the original data; the middle one is designed to extract and predict seasonal features over time; the right one is proposed to enhance the ability of the model to catch the peaks in forecast. Finally, the outputs of three AdaBoost-LSTM deep learning model are summed up to gain the final forecasts of water demand. The whole model aforementioned is named STL-Ada-LSTM model.
Identification and processing of outliers
Outliers in time series, depending on their nature, may have a moderate to significant impact on the model forecast (Chen and Liu 1993). To guarantee the reliability of data, the 3σ criterion is used to distinguish the outliers of original water demand series Xt. Using the 3σ criterion, Xt will be controlled in a 99.73% confidence interval (Du et al. 2021) and the other outliers will be smoothed to fit in with the standard by the weighted average method as Eq. (1):
\({E_t}={\theta _{t - k}}{x_{t - k}}+ \cdots {\theta _{t - 1}}{x_{t - 1}}+{\theta _{t - 1}}{x_{t - 1}}+ \cdots +{\theta _{t+k}}{x_{t+k}}\) (1) \({\theta _{t - k}}\) and\({x_{t - k}}\) represent the weighted values and historical data near the outliers, respectively; k refers to a positive integer and \({E_t}\) is the smoothed outliers. Finally, all the data will be processed in the band by \(\left[ {{\mu _t} - 3{\sigma _t},{\mu _t}+3{\sigma _t}} \right]\), where \(\mu\) and \(\sigma\) represent the mean and standard deviation of the original water demand series respectively (Alvarado-Barrios et al. 2020).
Seasonal and Trend Decomposition Using Loess
Seasonal and Trend Decomposition Using Loess (STL) is a time series decomposition method based on locally weighted scatterplot smoothing (loess) (Cleveland et al., 1990). The time series could be decomposed into the three additive components of seasonal St, trend Tt, and remainder Rt components:\({X_t}={S_t}+{T_t}+{R_t}\). Compared with other traditional seasonal decomposition techniques, such as X-12-ARIMA and the ratio-to-moving-average method, STL is able to provide more robust results (Xiong et al. 2018). Because the short-term water demand time series is characterized by seasonality and instability (Antunes et al. 2018), the STL is suitable for it. Also, this method is also not as complicated as discrete wavelet transform which needs to be tuned. STL is an iterative method consisting of two recursive procedures, inner and outer loops. The detailed steps of this method are described in the study by Tao Xiong et. al (2018).
Deep Learning Using Long Short-Term Memory
Long Short-Term Memory (LSTM) is controlled by three gates, i.e, input gate, output gate and forget gate, forming a self-loop update. The forget gate (f) decides how much information will be kept and passed to the next stage, the input gate (i) decides how much new information will be added, and the output gate (o) updates the system state using the information and cell state (c) from the previous two gates. The update process of the LSTM is illustrated briefly as follows (Han et al. 2019):
$${f_t}=\sigma \left( {{W_f} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_f}} \right)$$
2
$${i_t}=\sigma \left( {{W_i} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_i}} \right)$$
3
$${O_t}=\sigma \left( {{W_O} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_O}} \right)$$
4
$${\tilde {c}_t}=\tanh \left( {{W_c} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_c}} \right),{c_t}={f_t} \times {c_{t - 1}}+{i_t} \times {\tilde {c}_t}$$
5
$${h_t}={o_t} \times \tanh \left( {{c_t}} \right)$$
6
Where \({W_f}\), \({W_i}\), \({W_c}\), \({W_O}\) denote the weight matrices of forget gate, input gate, cell state and output gate respectively; \({b_f}\), \({b_i}\), \({b_c}\) are the bias items of forget gate, input gate and output gate, respectively; represents the hidden state, x represents the input, t represents the time, σ means sigmoid function, and “×” means point-wise multiplication, respectively.
The Back Propagation Through Time (BPTT) algorithm is used to train the LSTM model (Tepper et al. 2016). Firstly, calculate forwardly the outputs of each memory cells in LSTM and then calculate backwardly the error for each cell. Afterwards the gradients and biases of each weight matrix using the errors are supposed to figure out. Finally, put gradients and biases into optimization algorithms, such as Stochastic Gradient Descent (SGD) or Adam (Tepper et al. 2016). In this study, the Adam optimization algorithm is applied to train every single LSTM of the ensemble mode.
Adaptive Boosting
As an efficient ensemble learning model, Adaptive Boosting (AdaBoost) performs well in time series forecast (Bai et al. 2021; Xiao et al. 2019) by combining a set of classifiers into a more powerful learner to improve model performance. Weak classifiers are trained and each time the next weak leaner is trained on a different set of weights of the sample. The weights are determined by errors and boosting is used to reduce bias as well as variance for supervised learning. The best weak classifiers selected by each iteration training are built into strong classifiers (Xiao et al. 2019). The construction of AdaBoost algorithm is introduced as follows(Bai et al. 2021):
Step1: Input training dataset\(\left( {X,Y} \right)=\left\{ {\left( {{x_1},{y_1}} \right),\left( {{x_2},{y_2}} \right), \cdots ,\left( {{x_N},{y_N}} \right)} \right\}\), where X is historical water demand data, Y is the current observed data, and N is the length of data, Among them, \({x_i}\) is a column vector with d entries, \({x_i} \in \chi \subseteq {R^d}\).
Step2: Initialize weights, \({D_1}=\left( {{w_{11}},{w_{12}}, \cdots ,{w_{1N}}} \right)\)
\({w_{1i}}=1/N,i=1,2,...,N\) (7)
Step3: Repeat the following for m = 1,2,…,M to get M base learners, the base learners are LSTM models in this study:
(1) The mth base learner is obtained by training the data according to the sample weight distribution Dm \({G_m}\left( x \right):{G_m}\left( x \right):\chi \to \left\{ { - 1,+1} \right\}\)
(2) To calculate the classification error rate of \({G_m}\left( x \right)\) on the weighted training dataset:
$${e_m}=\sum\limits_{{i=1}}^{N} {P\left( {{G_m}\left( {{x_i}} \right) \ne {y_i}} \right)} =\sum\limits_{{i=1}}^{N} {{w_{mi}}} I\left( {{G_m}\left( {{x_i}} \right) \ne {y_i}} \right)$$
8
In the above equation, \(I\left( \cdot \right)\) is the indicator function.
(3) The calculated \({G_m}\left( x \right)\) coefficient (i.e. the weight of the base learner used in the final integration) is:
$${\alpha _m}=\frac{1}{2}\log \frac{{1 - {e_m}}}{{{e_m}}}$$
9
(4) Update weights of the training samples:
$${D_{m+1}}=\left( {{w_{m+1,1}},{w_{m+1,2}},...,{w_{m+1,N}}} \right)$$
10
$${w_{m+1,i}}=\frac{{{w_{mi}}}}{{{Z_m}}}\exp \left( { - {\alpha _m}{y_i}{G_m}\left( {{x_i}} \right)} \right),i=1,2,...,N$$
11
where \({Z_m}\) is the normalization factor, so that all the elements of \({D_{m+1}}\) sum to one.
$${Z_m}=\sum\limits_{{i=1}}^{N} {{w_{mi}}} \exp \left( { - {\alpha _m}{y_i}{G_m}\left( {{x_i}} \right)} \right)$$
12
(5) Construct the final linear combination of the base learners:
$$f\left( x \right)=\sum\limits_{{i=1}}^{M} {{\alpha _m}} {G_m}\left( x \right)$$
13
The final regression mode is obtained as:
$$G\left( x \right)=\sum\limits_{{i=1}}^{M} {{\alpha _m}{G_m}\left( x \right)}$$
14
According to Eq. (9), when the error rate \({e_m} \leqslant 0.5\), \({\alpha _m} \geqslant 0\). And \({\alpha _m}\) increases with the decrease of \({e_m}\), that is, the smaller the classification error rate is, the larger the proportion of the base learner will be in the final integration. That is, AdaBoost can adapt to the training error rate of each weak classifier.
Ensemble AdaBoost-LSTM Deep Learning Model
The AdaBoost algorithm was originally designed for classification; consequently, to use this algorithm for water demand forecasting, it needs to be modified appropriately. In this study, we develop the AdaBoost-LSTM(Ada-LSTM) model by adjusting the sample weights depending on whether a specified threshold is exceeded. And the LSTM deep learning models are used as week learners in the Ada-LSTM model. Figure 2 shows the architecture of the proposed LSTM-AdaBoost model for SWDF.
Firstly, the decomposed data series Tt and Rt are divided them into two sets (80% training set and 20% testing set) respectively. Secondly, the optimal structure of a single LSTM model is selected by analyzing the sensitivities of the main hyper-parameters including the number of hidden layers and the number of neurons in hidden layers and train epoch. In experiments, the BPTT (Back Propagation Through Time) is used to optimize the training process of LSTM networks. Additionally, the value range of hyper-parameters refers to the existing studies (Du et al. 2021; Bai et al. 2021). To reduce the number of parameters of the proposed model to ease the tuned procedure, the values of each hyper-parameter in every LSTM base leaner are the same and only four hyper-parameters needed to be set according to the reference interval. The loss function of LSTM model is mean squared error function. After training by the first LSTM model, the result of the first base learner \(G_{{_{1}}}^{T}\) and\(G_{{_{1}}}^{S}\) are automatically imported to the Trend and Seasonal AdaBoost Model respectively. Then the weights of the first LSTM model will be updated and transformed to the second LSTM model based on the error rate of previous. The weight of the samples that are not accurately predicted in the last round will increase in the next round to improve the performance of these samples. The forecast results of the second LSTM are also automatically imported to the AdaBoost Model in the whole ensemble model. According to this law, the training data sets are trained by the LSTM models one by one to improve the forecast of the samples continually. After multiple iterations, the results of the LSTM models are weighted and combined to gain the final strong learner according to Eq. (9) ~ Eq. (12).
Different from the forecast procedure of the other two decomposed series, the residual series is ranked and all the peaks identified by the hypothesis testing about differences of two consecutive slopes introduced by Bramante (2019) and other historical data are put into training data set before forecast due to its large fluctuations. Finally, the sum of the Trend AdaBoost-LSTM, Seasonal AdaBoost-LSTM and Residual AdaBoost-LSTM model output is the final forecast.