Climate-driven Model Based on Long Short-Term Memory and Bayesian Optimization for Multi-day-ahead Daily Streamflow Forecasting

Many previous studies have developed decomposition and ensemble models to improve runoff forecasting performance. However, these decomposition-based models usually introduce large decomposition errors into the modeling process. Since the variation in runoff time series is greatly driven by climate change, many previous studies considering climate change focused on only rainfall-runoff modeling, with few meteorological factors as input. Therefore, a climate-driven streamflow forecasting (CDSF) framework was proposed to improve the runoff forecasting accuracy. This framework is realized by using principal component analysis (PCA), long short-term memory (LSTM) and Bayesian optimization (BO), referred to as PCA-LSTM-BO. To validate the effectiveness and superiority of the PCA-LSTM-BO method along with one autoregressive LSTM model and two other CDSF models based on PCA, BO, and either support vector regression (SVR) or gradient boosting regression trees (GBRT), namely, PCA-SVR-BO and PCA-GBRT-BO, respectively, were compared. A generalization performance index based on the Nash-Sutcliffe efficiency (NSE), called the GI(NSE) value, is proposed to evaluate the generalizability of the model. The results show that (1) the proposed model is significantly better than the other benchmark models in terms of the mean square error (MSE<=185.782), NSE>=0.819, and GI(NSE) <=0.223 for all the forecasting scenarios; (2) the PCA in the CDSF framework can improve the forecasting capacity and generalizability; (3) the CDSF framework is superior to the autoregressive LSTM models for all the forecasting scenarios; and (4) the GI(NSE) value is demonstrated to be effective in selecting the optimal model with better generalizability.


Introduction
Accurate streamflow prediction is vital to water resource allocation and planning (Abro et al. 2020). Therefore, the development of streamflow prediction models has already attracted significant attention in recent decades. These models are mainly subdivided into data-driven and physical models (Chua 2012;Shirmohammadi et al. 2012). Physical models have high requirements for input climate data, information about physical properties, boundary conditions, and computational resources. Hence, it is rarely used in practical applications. The data-driven model is extensively used because of less complex and low information demands (Fang et al. 2019;Wu and Huang 2009).
Previous data-driven streamflow forecasting models have focused on time series models: autoregressive integrated moving average models, multivariable linear regression models, moving average models and so on (Myronidis et al. 2018;Sun et al. 2019;Yu et al. 2018). However, these time series models fail to reasonably forecast nonlinear runoff series due to the stationarity assumption (He et al. 2019). Machine learning (ML) models long shortterm memory (LSTM) (Li and Cao 2018), support vector regression (SVR) (Moazenzadeh et al. 2018) and gradient boosting regression trees (GBRTs) (Liao et al. 2020) can address the nonstationary and nonlinear problems of runoff prediction (Farfán et al. 2020;Vapnik et al. 1996). However, the pure ML models still perform poorly for predicting complex nonlinear and nonstationary runoff series, while the model is developed without input climate factors. In many study cases with available input climate data, only rainfall information was considered to predict runoff (Mao et al. 2021). Although rainfall-runoff modeling contributes greatly to improving runoff forecasting performance, more climate information is needed to further improve the streamflow forecasting performance (Awotwi et al. 2021). Additionally, signal decomposition algorithms such as variational mode decomposition, singular spectrum analysis and wavelets have been introduced to handle the nonlinear and nonstationary problems of raw runoff series. However, these algorithms introduce large decomposition errors when performing decomposition without using future information (Zhang and Haghani 2015;Zhao et al. 2021).
To address these problems, a climate-driven streamflow forecasting (CDSF) framework is proposed. In the CDSF framework, the climate data dimensionality is first reduced to save computational resources and modeling time and decrease the overfitting risk. Then, using these climate data as input, a data-driven model is used for predicting runoff. The hyperparameters of this data-driven model are finally tuned by an optimization algorithm. Variation analysis, principal component analysis (PCA), cluster analysis and so on could be used to reduce dimensionality (George and Vidyapeetham 2012;Kourtit et al. 2021;Narayan and Ghosh 2021). However, PCA is highly efficient and can transform the input variables into uncorrelated variables (Asante-Okyere et al. 2020; Bartoletti et al. 2018). Many data-driven models, such as LSTM, GBRT and SVR, could implement the functionality of runoff forecasting. However, back propagation neural network (BPNN) models usually suffer from overfitting, local convergence, and low learning speed (Bisoyi et al. 2019). SVR and GBRT, which are shallow learning methods, are highly susceptible to the selection of hyperparameters and have a low capacity to express different information (He et al. 2019;Huang et al. 2021). The LSTM model, a deep learning model, can address these drawbacks and learn long-term dependency from input meteorological data (Bai et al. 2019). Although trial and error, grid search (GS), genetic algorithm (GA), random search (RS), Bayesian optimization (BO) and so on could be used for hyperparameter optimizations, BO is especially useful for expensive function evaluation (Dewancker et al. 2016;Manheim and Detwiler 2019;Su et al. 2014;Zuo et al. 2020b), such as tuning the hyperparameters of LSTM in this study. Therefore, we realized this CDSF framework using PCA, LSTM, and BO, namely, PCA-LSTM-BO.
To validate the effectiveness and superiority of the CDSF framework and the PCA-LSTM-BO model, one autoregressive LSTM model and two other CDSF models based on PCA, BO, SVR, and GBRT, namely, PCA-SVR-BO and PCA-GBRT-BO, were compared. The first experiment compares the forecast ability of different CDSF models to show the advantages of LSTM. The second experiment compares the proposed model with autoregressive models to prove the stability and high generalizability of the CDSF framework. The proposed model and the benchmark models are evaluated using daily runoff data and climate data collected from four stations located in the Huangshui River catchment, China.

Study Area and Data Source
The study area is located in eastern Qinghai Province, China, and the Huangshui River (Fig. S1) is an important branch of the Yellow River. The Huangshui River is 374 km in full-length. It originates from Baohutu Mountain in Haiyan County, Qinghai Province. The Huangshui River has a continental climate, and because of the great terrain difference in this area, the temporal and spatial variations in temperature are also large. The mean annual streamflow of the Huangshui River is 46.5×10 8 m 3 and that of the Minhe station on the mainstream is 17.9×10 8 m 3 . Nearly 60% of Qinghai's population, 52% of cultivated land and more than 70% of industrial and mining enterprises are concentrated in the Hehuang Valley. It is also the main source of urban water in Lanzhou. Therefore, robust runoff prediction plays a vital role in production and life in this area.
In this study, daily runoff and climate data are used to evaluate the proposed framework. Runoff data are from hydrological yearbooks, and climate data are from the China Meteorological Information Center (http:// data. cma. cn, the last visit time 2021/09/18). The details of variables are shown in Table S1. Among them, all variables are used as input and runoff as output. Each variable is from January 1, 2006, to December 31, 2013 (2922 records). The time series of each variable is subdivided into three groups: validation set, testing set and training set. These datasets represent approximately 20%, 20% and 60% of the entire dataset, respectively.

Principal Component Analysis
As a simple and useful means, PCA is used to reduce the dimensionality of a group of related features while preserving as much original information as possible (Abdi and Williams 2010). The process of PCA focuses on seeking a larger variance within the same variable but a small covariance among different variables. The PCA method has five main steps: (1) standardizing the original variables, (2) computing the covariance matrix of the standardized variables, (3) computing the eigenvalues and eigenvectors of the covariance matrix, (4) forming the feature vector from the eigenvalues and eigenvectors, and (5) recasting the original variables to principal components (PCs) (Boubchir and Aourag 2020; Davis and Sampson 1986).
(1) Z j = a j1 * X 1 + a j2 * X 2 + ⋯ + a jP * X p where Z j represents the PCs; a j1 are the related eigenvectors; X p represents the input variables; and p represents the number of input variables.
The number of PCs (or predictors) is the only parameter that should be predefined in PCA. The number of predictors for each station is different.

Long Short-term Memory
To solve the gradient explosion and disappearance of recurrent neural networks in long sequence training, LSTM was generated. Similar to recurrent neural networks, LSTM (see Fig. S2) also has the structure of a neural network repeating module chain (Bengio et al. 1994;Zuo et al. 2020b).
LSTM mainly controls the flow of information to the cell state by three "gates": output gate (o t ) , forget gate (f t ) , and input gate (i t ).
Step 1: Use the sigmoid function of the forget gate to determine what information the cell state needs to discard. It outputs a vector f t whose range is (0,1) through the information of h t−1 and x t . The value of this vector represents what information is preserved and what is abandoned in the cell state C t−1 (Kratzert et al. 2018).
where t is each time step from t = 1 to t = n ; h t−1 is the last hidden cell state; and x t is the current input vector. b f , U f and W f represent the bias vector, recurrent weight matrix and input weight matrix, respectively.
Step 2: Judge what information is used to renew the cell state by h t−1 and x t (Zuo et al. 2020a). Then, h t−1 and x t are used to obtain new candidate cell information ∼ C t through the tanh layer, which may be updated with cell information, as follows (Kratzert et al. 2018): where tanh(⋅) is the tanh activation function, which maps a real number input to [-11]; when the input is 0, the output of the tanh function is also 0.
Step 3: Use formula (5) to update C t−1 and obtain the new cell information C t .
Step 4: After updating the cell state, judge which state characteristics of the output cell are based on the input h t−1 and x t ; the judgment condition is obtained by inputting through the sigmoid function layer of o t . Next, the cell state obtains a vector through the tanh level. The vector times the judging conditions gained by o t to obtain the final output of this unit is as follows (Kratzert et al. 2018): where b o , U o and W o are the bias vector, recurrent weight matrix and input weight matrix, respectively. The range of h t is [-1,1]. (2) Y. Lian et al.

Bayesian Optimization
BO is a method commonly used to adjust hyperparameters, which can be used to optimize the black-box function. There are two main parts in the process of BO: a prior function (PF) and an acquisition function. If the distribution of the model is known, the optimal model can be selected according to experience; if not, the kernel function based on a Gaussian process (GP) can be used as the black-box function for self-learning. Assume that the PF follows a Gauss distribution. Therefore, the function values f (x i ) of the PF also obey a Gaussian distribution, as follows (Rasmussen 2004): represents the average function set of GP and K represents a kernel matrix.
The acquisition function is also called the utility function. At present, there are three commonly used functions: the expected improvement (EI), probability of improvement and entropy search. Here, we choose the commonly used EI as the acquisition function as follows (Dewancker et al. 2015): where f (x + ) , (x) and Φ(z) represent the current maximum value, expected loss value and cumulative distribution function, respectively. (x) is the variance of f (x) , and (z) is the probability density function.

The CDSF Framework and the Realization of PCA-LSTM-BO
Decomposition-based models and rainfall-runoff modeling have limited capacity to predict runoff series that have complex nonlinearity, high irregularity, and multiscale variability. Therefore, this study develops a CDSF framework and realizes this framework with PCA-LSTM-BO to improve the runoff forecasting accuracy. The CDSF framework contains three main stages: (1) dimensionality reduction, (2) long-term dependency learning and (3) forecasting stage. During the first stage, PCA is used to reduce the input dimensionality to save time and computational resources and decrease the overfitting risk. In the second stage, the LSTM model and BO algorithm are used to learn the long-term dependency from climate data to runoff series. In the last stage, the optimized LSTM is used to predict the testing samples. The diagram of the CDSF framework and its PCA-LSTM-BO realization is illustrated in Fig. 1 and is summarized as follows.
Step 1 Collect runoff time series and climate time series as inputs of the CDSF framework.
Step 2 Select the optimal lag for each time series by the partial autocorrelation function (PACF) Step 3 Generate learning samples including input predictors and output target for different lead times based on the optimal lags obtained in Step 2.
Step 4 Use PCA to reduce the input dimensionality of learning samples obtained in Step 3. The number of predictors achieved through dimensionality reduction is set from 0 to half of the number of input predictors.
Step 5 Divide the learning samples obtained in Step 4 into the testing set, validation set and training set (accounting for 20%, 20% and 60%, respectively, of the whole daily runoff samples in this study).
Step 6 Divide the learning samples obtained in Step 3 into the testing set, validation set and training set (accounting for 20%, 20% and 60%, respectively, of the whole daily runoff samples in this study). The purpose is to compare the difference of the input variables with and without PCA. Step 7 Train the parameters of the PCA-LSTM-BO model with the training set obtained in Step 5 and Step 6; then optimize the hyperparameters of the PCA-LSTM-BO model with the validation set obtained in Step 5 and Step 6 and the BO algorithm.
Step 8 Input the testing sample obtained in Step 5 and Step 6 into the optimized PCA-LSTM-BO model to predict the runoff time series.

Predictors and Predicted Targets
The runoff series along with 13 climate time series (see Table S1) are selected to build the LSTM model. The input predictors and output targets are first determined to generate the learning samples. The PACF is widely used in determining the optimal input lags in ARMA models and ML models (He et al. 2019). However, using some lags that pass the 95% confidence test but are insignificant leads to a high computational cost and modeling time. Therefore, we select all lags before the first insignificant lag as the optimal input. The average air pressure at the Xining station is used as an example to reveal how to determine the optimal lags using the PACF plot from Fig. S3. In general, the lag of the PACF value exceeding the confidence interval (light blue shaded area) is optional and we choose the first several significant lags to obtain predictors. As seen from Fig. S3, the first three lags are more significant, while the lags after the third lag are insignificant. Although the 5th, 12th and 17th lags exceed the confidence interval, they are not very significant, which may be caused by dirty data. Therefore, we select, x (t) , x (t−1) and x (t−2) as the optimal input lags of the average air pressure. In this way, the optimal inputs of all the time series are selected (see Tables S2 and S3). The optimal inputs of each series are merged as the final predictor of the PCA-LSTM-BO model. Additionally, the predicted targets of the PCA-LSTM-BO model are the original daily runoff data Q (t+1) , Q (t+3) , Q (t+5) and Q (t+7) for runoff prediction with 1, 3, 5 and 7 days ahead.
First, the optimal input lags of 14 features were determined by PACF; then, these optimal lags of 14 features were combined to obtain 36 input predictors. Figure S4 shows that the Pearson correlation coefficients of many predictors are larger than 0.5, indicating that these predictors are highly correlated at Minhe station. The time series data of Xining station also obtained similar results. The linear correlation between features implies that the model might use redundant information. Some of the selected climate features are correlated, which may lead to overfitting, and more input variables will also reduce the convergence speed. PCA can transform linear correlated features into uncorrelated features and retain the original information as much as possible.
Therefore, we used PCA to transform the correlated features into uncorrelated principal components (PCs) (see Fig. S5). It can be observed from Fig. S5 that the variance values of the latter PCs are very small, which implies that these PCs contribute little to runoff change and can be removed (Myronidis and Ivanova 2020). In addition, Fig. S5 also shows that the variances of other PCs after the 9th PCs are all close to 0, and the cumulative variance ratio of the first nine PCs exceeds 99%. In general, it is enough to select the first nine PCs for modeling, but the risk of information loss may be caused by removing the PCs. Therefore, different dimensionality reduction scales are evaluated. In the study, the range of the number of excluded predictors is 0~18 (half of the total number of predictors at each station). We also estimated the optimal number of predictors by using maximum likelihood estimation (MLE) (Minka 2001). In addition, to obtain the optimal model, the proposed CDSF framework selects the optimal model from the input variables with and without PCA.

Sample Normalization
To improve the convergence speed of the model in BO, all the samples are normalized to [-1,1] using max-min normalization; the formula is as follows: where x , y, x max and x min are the original, normalized, maximum and minimum values, respectively, in the original samples. To avoid using the validation and testing sample information, the parameters x max and x min are calculated based on the training samples and used to normalize the training validation and testing samples (Zuo et al. 2020b).

Evaluation Criteria
To evaluate the performance of the model, we propose a generalization index based on the NSE (GI(NSE)). In addition, the mean squared error (MSE) and the Nash-Sutcliffe efficiency (NSE) were also applied in this study (Marmolin 1986;Nash and Sutcliffe 1970). The mathematical formulas as follows: where T is the number of samples; x(t) , x(t) and x(t) are the recorded samples, the average of measured samples and predicted samples, respectively. For GI(NSE), NSE train , NSE val , and NSE test are the NSE values of the training set, validation set and testing set, respectively. a is the weight of the sum of the distances between NSE train , NSE val and NSE test and 1. b is the weight of the sum of the distances among NSE train , NSE val , and NSE test . The closer the value of GI(NSE) is to 0, the better generalizability of the model.

Parameter Optimization
In this study, we test the performance of LSTM by comparing the prediction results of PCA-LSTM-BO with those of PCA-SVM-BO and PCA-GBRT-BO. Table S4 shows the relevant hyperparameter settings and search range for each climate-driven model (Zuo et al. 2020b). Each hyperparameter was adjusted to minimize the MSE. Figure 2 shows the comparison between the NSE of PCA-LSTM-BO with and without dimensionality reduction for 1-day ahead streamflow forecasting. As seen from Fig. 2a, b, the overall gap between the training (or validation) NSE values and the testing NSE values of the PCA-LSTM-BO with dimensionality reduction is smaller than that of the PCA-LSTM-BO without dimensionality reduction; and with the increases in the number of excluded predictors, this gap also shows a decreasing trend. Additionally, the testing NSE values are larger than 0.93, indicating that the PCA-LSTM-BO methods forecast the unseen data reasonably well. Meanwhile, it can also be seen from Fig. 2 that most of the results with dimensionality reduction are in the interval generated by the horizontal line of the results without dimensionality reduction, indicating that dimensionality reduction can improve the generalization performance of the model. These results show that dimensionality reduction can enhance the generalizability of PCA-LSTM-BO because PCA can transform the correlated predictors into uncorrelated predictors and reduce the overfitting risk.

Forecasting results with PCA-LSTM-BO
The GI(NSE) values of different dimensionality-reduction scenarios of the prediction results of the PCA-LSTM-BO at the Xining and Minhe stations 1, 3, 5, and 7 days ahead are shown in Table 1, where "0" means that the predictors are not excluded but transformed into uncorrelated predictors. Because GI(NSE) is closer to 0, the better. The black bold GI(NSE) values in Table 1 represent the optimal PCA settings of PCA-LSTM-BO models for streamflow prediction of 1, 3, 5 and 7 days ahead at the Xining and Minhe stations. Fig. 2 The NSE of 1-day-ahead streamflow forecasting for the PCA-LSTM-BO model The final prediction results and scatter diagrams of the optimal PCA settings for the two stations are shown in Figs. 3 and S6, respectively. The forecasted values can vary with the samples, which is basically consistent with the measured value, but some prediction results are underestimated and overestimated, as shown in Fig. 3. The scatter diagram shows that the measured and predicted value clusters are concentrated near the ideal fit of 1-day-ahead streamflow forecasting and that the angle between the ideal fit and linear fit is smaller, which implies that the forecasted values have a greater consistency with the observations. However, the correlation value of the PCA-LSTM-BO becomes more dispersed around the linear fitting with increasing lead time for predicting runoff with 3, 5 and 7 day-ahead. Figure S6 can also get similar results. It can also be seen from these two figures that the prediction precision gradually decreases with increasing lead time. The reason is that with the growth of the lead time, the influencing factors increase and the contingency increases. These two figures also show that the forecast results of Minhe station are better than those of Xining station. The research shows that human activities have been the dominant factor impacting the streamflow change in the Huangshui River since 1986. The impact of human activities on streamflow in this area is mainly reflected in the impact of large reservoirs, water diversion projects and water intake and drainage on streamflow, which will further lead to the increase of runoff nonstationary. That is why the prediction result of Minhe station in the remote countryside is better than that of Xining station in the city center. (Li et al. 2020;Su et al. 2021).

Contrastive Results and Analysis
SVR performance depends on the kind of kernel function. Therefore, to find a better performance for SVR, three different kernel functions are selected for comparison. As shown in Table S5, the GI(NSE) value of the RBF kernel function is smaller than that of the sigmoid and linear kernel functions in the streamflow prediction of Minhe station with 1, 3, 5 and 7 days ahead. Therefore, it validates that the RBF is the best kernel function among these kernels, and the following SVR results are based on the RBF.
To evaluate the superiority of the PCA-LSTM-BO model, two CDSF realizations based on SVR and GBRT, namely, PCA-SVR-BO and PCA-GBRT-BO, are compared using the same dataset. In addition, an autoregressive LSTM model is also compared with these CDSF realizations. Table 2 shows the quantitative evaluation results of these optimized models. The GI(NSE) values of the PCA-LSTM-BO method for runoff prediction are lower than those of the PCA-SVR-BO, PCA-GBRT-BO and LSTM methods, illustrating that PCA-LSTM-BO outperforms the other models in generalizability.
As seen from Table S6, the optimized PCA-BO-LSTM selected by GI(NSE) has a higher NSE value than other optimized models in most scenarios. The model based on the CDSF framework has a higher NSE value than LSTM in all scenarios. To further intuitively represent the performance changes, the performance gap between the models is shown in Fig. 4. As shown in Fig. 4a, b, all the CDSF realizations show similar trends to predict daily runoff 1, 3, 5 and 7 days ahead at the Xining station, but the proposed PCA-LSTM-BO has lower NSE and higher MSE values compared with the PCA-SVR-BO and PCA-GBRT-BO methods and has much higher NSE and lower MSE values compared with the LSTM method. As can be observed from Fig. 4c, d, similar results are obtained for the Minhe station, illustrating the superiority of the PCA-LSTM-BO model. Overall, the forecasting performances can be ranked as PCA-LSTM-BO > PCA-SVR-BO ≈ PCA-GBRT-BO > LSTM. Moreover, the CDSF realizations are all superior to the single LSTM method, indicating the advantages  of the CDSF framework in daily runoff forecasting. In addition, with increasing lead time, the NSE (MSE) gap between the LSTM and the PCA-LSTM-BO gradually increases from 0.0179 (8.913) to 0.0622 (30.807) at the Xining station, indicating that the CDSF framework is more stable than the autoregressive LSTM model for longer lead times. Overall, the above results sufficiently illustrate that PCA-LSTM-BO has the optimal predictive performance relatively.  Figure S7 display the forecasting results generated by the PCA-LSTM-BO, PCA-SVR-BO, PCA-GBRT-BO and LSTM models. As seen from Fig. S7b-d, PCA-SVR-BO and PCA-GBRT-BO have a certain ability to fit the trend and periodicity but have a poor tracking ability for predicting peak runoff and capturing random variations. As seen from Fig. S7, the LSTM model is better at forecasting the periodicity but performs poorly for forecasting the peak and valley runoff. Furthermore, the PCA-LSTM-BO method generally follows the runoff trend and has a better tracking ability for forecasting the observed peak runoff. In general, the PCA-LSTM-BO method has a better forecasting accuracy and generalizability performance than the other three models for forecasting daily runoff at the Xining and Minhe stations.

Conclusion
To avoid the introduced decomposition errors of decomposition-based models and the drawbacks of rainfall-runoff modeling in runoff forecasting, this study proposed a novel CDSF framework realized with PCA-LSTM-BO. There are four stages in implementing the CDSF framework: (1) Determine the optimal lag for each variable by PACF to form predictors; (2) reduce the input dimension by PCA to decrease the overfitting risk; (3) tune the hyperparameters of an LSTM model using BO; and (4) forecast the future streamflow using the optimized LSTM model. CDSF realizations, including PCA-LSTM-BO, PCA-SVR-BO and PCA-GBRT-BO, and an autoregressive LSTM were compared for predicting daily runoff. There are some main conclusions in this study.
1. PCA in the CDSF framework can improve the forecasting capacity and generalizability. 2. The CDSF framework is superior to the autoregressive LSTM models for all forecasting scenarios. 3. The PCA-LSTM-BO has the optimal performance in generalizability performance among all the CDSF realizations. 4. The GI(NSE) value is demonstrated to be effective in selecting the generalizability of the model.
The novelty of this paper is that a framework based on PCA, LSTM and BO is proposed, and the impact of climate data on daily streamflow is considered. In addition, a new criterion i.e., GI(NSE) is proposed to select an optimal model with better generalizability. The results of this study not only demonstrate the superiority of CDSF framework and the ability of PCA-BO-LSTM model in streamflow forecasting but also demonstrate that GI(NSE) is effective in selecting the model with better generalizability.