An optimal integration of multiple machine learning techniques to real-time reservoir inflow forecasting

A reservoir inflow forecasting system represents a crucial technique in reservoir operation and disaster prevention, particularly in areas where the primary water source derives from typhoon events. This includes the study area of the current research, i.e., the Shihmen Reservoir (Taiwan). Effectively depositing short and high-intensity rainfall and avoiding disaster losses present significant challenges in this regard. However, the high variability and uncertainty of such rainfall events make them difficult to forecast using traditional physical-based models, which require too many calculations for application in real-time disaster forecasting. Accordingly, in this study, seven machine learning (ML) algorithms, including three conventional ML and four deep learning algorithms, were compared to derive their effectiveness for reservoir inflow forecasting in extreme weather events. The forecasting lead-times were set to 1, 4, and 6-h, representing short, medium, and long-term forecasting, respectively. Moreover, to ensure the stability and credibility of the models, two types of integrated approaches, ensemble means and switched prediction method (SP) were also employed. The results showed that although an optimal algorithm could be selected for the short, medium, and long-term, individual algorithms did not always perform well in all events. Nonetheless, the integrated approaches can effectively combine the advantages of all the included algorithms and generate more accurate and stable forecasting results, particularly when using SP, which was involved in the top three performances among all typhoon examples and indicated the best average performance. In the short-term forecast, the RMSE of the testing events is 107.2 m3/s while using SP, ranking 3rd among all 9 methods. In the medium-term forecast, the RMSE predicted by the SP is 281.72 m3/s (Rank = 1). In the long-term forecast, the SP also performed the best among the 9 methods, and the RMSE was 477.14 m3/s. In conclusion, if only single model forecast is considered, gated recurrent unit, a type of transformed recurrent neural network, will yield the best performance. Furthermore, integrated forecasts, particularly involving SP, can effectively improve the accuracy and stability of forecasts to render a model more applicable to an actual situation.


Introduction
In recent years, climate change has received significant attention due to global warming, which has led to a higher frequency of extreme weather events. Accordingly, water resource management and disaster prevention have become vital subjects. Based on the location and climate of Taiwan, the area experienced an average of three to four typhoons annually from 1911 to 2019. (Wang et al. 2019) The average annual rainfall of Taiwan is 2500 mm, 2.6 times higher than the average global precipitation. Despite this abundant rainfall during typhoon events, reservoirs in Taiwan are unable to store these large water resources, which are accumulated within short-term rainfall durations. As a result, rather than serving as a resource, excessive rainfall may cause disasters such as flooding and landslide. (Chen et al. 2021) To prevent the loss of human life and property, a sophisticated management system and erudite operation of reservoirs are necessary.
Reservoir inflow forecasts are crucial techniques for the management and operation of water resources. Especially for Taiwan, which is shaped long and narrow, where reservoir management will highly impact both the water supply for daily living and industrial use. As such, an accurate reservoir inflow forecasting model will be indispensable for managing the water supply systems in Taiwan. Therefore, researchers have attempted to construct inflow forecasting models using physical, empirical, and conceptual methods. Young et al. (2015) used reservoir inflow, forecasted by the hydrologic modeling system (HEC-HMS) as an input for artificial neural networks (ANNs) to forecast future inflow over a 6-h lead-time. Noori and Kalin (2016) used the soil and water assessment tool to simulate the base flow, surface runoff, and interflow as input factors of ANNs to forecast a 24-h lead-time inflow. Young et al. (2017) simulated the inflow through HEC-HMS and used forecasting inflow as the input factors for the support vector machine (SVM) model to forecast a 6-h lead-time inflow. Ren et al. (2018) used ten atmospheric factors selected from JRA-55, and simulated the snowmelt-rainfall-produced inflow generated by the Hydrologiska Byråns Vattenbalansavedlning (HBV) model as the inputs of a Bayesian neural network; the least-squares SVM model was used to forecast monthly inflow. From the above studies, the argument that combining physical models and machine learning (ML) algorithms can effectively estimate the inflow can be confirmed.
However, for extreme weather events such as typhoons, the long calculation time of physical models make them difficult to effectively early forecast. Instead, the ML models, which are superior to handle nonlinear relationships and has a faster calculation speed, are widely used in the field of hydrological information forecasting. In view of this, seven ML models commonly used in time series processing (hydrological data) were selected in this study. The ML used in this study can be grouped into two categories, i.e., the conventional ML and deep learning (DL). Conventional ML models are used in this studies included SVM, random forest (RF), and multi-layer perceptron (MLP), because they have been widely used in lots of researches in relative fields. (Yu et al. 2009;Kuo 2014;Mohammadi et al. 2020;Pham et al. 2020) On the other hand, with the recent rapid development of DL, ever more studies indicate that it can effectively deal with complex features hidden in big data. (Zeiler and Fergus 2014;Szegedy et al. 2015;Krizhevsky et al. 2017).
In the field of meteorology and hydrology, Bai et al. (2016) used non-recurrent deep backpropagation networks to forecast pinpoint inflow for the Three Gorges Dam (TGD) reservoir (China). Their results showed that DL could accurately forecast inflow and effectively handle multi-dimensional problems and complex features in the engineering field. Liang et al. (2018) used long short-term memory (LSTM) to analyze water-level variations in Dongting Lake (China) and discussed the impact of TGD on the lake. Their results showed that models constructed with LSTM had observably higher accuracy than conventional SVM models. Zhang et al. (2018) selected LSTM and a gated recurrent network (GRU) to forecast water levels in a sewage treatment facility. Their results indicated that both LSTM and GRU could effectively improve forecasting accuracy. In their study, GRU was somewhat more accurate than LSTMs because the input data employed in the research were comparatively less. Many other studies have confirmed that DL model can achieve good performance in hydrological uses. (Tao et al. 2016;Song et al. 2016;Kratzert et al. 2018) Considering the performance of time series processing and the discussion in the past research, deep neural network (DNN), RNN, LSTM, and GRU are selected in this study as part of DL models.
In addition, in recent years, to improve the accuracy and stability of model forecasting, many studies have attempted to use hybrid ML models for forecasting. In 2021, Aghelpout et al. used SVM coupling with Dragonfly algorithms (SVM-DA) to predict the Palmer Drought Severity in Zagros Mountains of Iran. According to their results, the hybrid model can effectively improve accuracy by 29% compared to the original model, so as other studies proposed similar conclusions. Ahmadi et al. 2021).
The present study attempt to construct a more reliable and accurate inflow forecasting model. In the first part of this study, the seven ML models mentioned previously were used to establish 1, 4, and 6-h lead time reservoir inflow forecasting models. In the second part, two integrated methods, ensemble means (EM) and switched prediction (SP) (Lian et al. 2015), are used in this study to integrate the results of seven algorithms as hybrid forecasting approaches.

Methodology
In this chapter, the study area and the algorithms used in this study were described. In the Sect. 2.1, the study area and data, Shihmen Reservoir, was introduced. Conventional ML algorithms such as SVM, RF, and MLP are described in Sect. 2.2. The DL algorithms are illustrated in Sect. 2.3. The SP algorithm is illustrated in Sect. 2.4. For Sect. 2.5, the algorithms used to determine the optimal input combinations and hyper-parameters of models would be introduced. The performance measures would be descripted in Sect. 2.6. In Sect. 2.7, the research process has been explained.

Study area and data
Shihmen Reservoir, located in northern Taiwan, was once the largest reservoir in the Far East region. As a multiobjective hydraulic structure, the reservoir is involved in irrigation, power generation, water supply, and flood control, and even serves as a sightseeing destination. The length and total area of the reservoir are approximately 16.5 km and 8 km 2 , respectively. The Shihmen Reservoir catchment area is 763.4 km 2 . Total reservoir and effective capacities are 309 and 209 billion m 3 . Based on the position of the reservoir, rainfall primarily derives from monsoons and typhoons. Figure 1 shows the study area adopted in this study, blue dots shown in figure represent the rainfall stations. Eight meteorological stations' rainfall data from 2004 to 2018 were collected to construct the reservoir inflow forecasting model. The model was considered for operational and warning uses in extreme events, in which the training and test datasets would be selected in an eventoriented manner. Accordingly, 18 typhoons (with intact and high-quality data) were adopted as shown in Table 1. The range of the ratio of training to testing dataset is usually between 3:2 and 2:1. Yet, the occurrence time and the severity of the event should be considered at the same time in this study. The ratio of training to testing dataset (2:1) is adopted. However, due to lack of typhoon with small peak inflow in training dataset, Typhoon Meranti with small inflow was moved from testing dataset to training dataset to maintain a balance of the severity of events in these two datasets. In view of this, the ratio was finally set at 13:5. The data were collected by the Water Resources Agency (WRA) in Taiwan, and the time interval for raw data was approximately 1 h.
According to James et al. (2013), the training dataset is used in modifying the parameters in the ML models. On the other hand, the testing dataset is used to verify the performances of models. In this study, due to space limitations, the follow-up results are mainly presented based on testing dataset. In addition, according to the heavy rainfall of typhoon and short time of concentration of the study area used in this study, most of the past studies in Taiwan used 6 h lead-time as the standard for long-term forecasting or even shorter than 6 h lead-time. Chen 2019;Kao et al. 2020) However, the fully 1-6-h forecasting results cannot be presented due to the space limitation. Therefore, the 1, 4, and 6-h ahead performances were chosen to represent the short, medium, and long-term forecast results instead.

Support vector machine (SVM)
SVM was created by Vapnik (1995) to overcome the challenges of problems identified in the early 1990s. There are two main advantages of using SVM. First, the structural Fig. 1 The study area including eight rainfall stations risk minimization adopted by the SVM can effectively reduce the loss of function without increasing the structural complexity of a model, allowing it to balance both its accuracy and computational speed. Additionally, solutions related to the structure and weights in the SVM can be simplified as a quadratic programming problem, which can be resolved using a standardized process. However, SVM still has some disadvantages that limit its usability, such as not good at handling big data; unable to handle complex nonlinear problems; sensitive to missing data, etc. More details about the principles of the SVM can be found in the literature (Vapnik and Cortes, 1995;Cristianini and Shawe-Taylor 2000). The model construction of the SVM can be found in Huang and Hsieh (2020).

Random forest (RF)
Created by Breiman (2001), RF is a powerful ensemblelearning ML algorithm that can be constructed using multiple decision trees. Based on the principles of RF, this algorithm has several advantages. First, in contrast to conventional ML algorithms, such as SVM and backpropagation networks, RF can effectively handle large amounts of input variables without the need to address dilemmas about data dependency and overfitting. Second, the randomness involved in bootstrapping contributes to the stronger ability of anti-noise from raw data. Furthermore, the large number of decision trees in RF also gives rise to its highly suitable ability for nonlinear functioning. However, when the RF be used to solve the regression problems, it might easily lead to overfitting. Because RF cannot output continuous values, which makes it impossible to extrapolate. The construction of the RF could be found in Huang and Hsieh (2020). More details about RF can be found in Liaw and Wiener (2002) and Ahmed et al. (2019).

Multilayer perceptron (MLP)
Proposed by Rumelhart et al. (1986), the MLP is a classic backpropagation neural network that can be constructed using three types of layers, i.e., input, hidden, and output layers. The weight correction algorithm used in MLP is backpropagation. To distinguish it from the DNN model, the activation function used in MLP does not include a relatively novel function (e.g., rectified linear unit [ReLU]). The number of hidden layers was also restricted to one. Because MLP was used for conducting regression in this study, the loss function, mean square error was adopted, and the optimization approach was classic stochastic gradient descent. Compared with SVM and RF, MLP has several advantages, such as better nonlinear  Haykin (2010) and Aghelpour and Varshavian (2021), the model construction can be found in Huang and Hsieh (2020).

Deep neural networks (DNN)
Hinton first proposed DNNs in (2006), which applied the restricted Boltzmann machine (RBM) ANN to initialize parameters and successfully overcome the optimization problem due to backpropagation. In addition to the RBM, several innovations occurred between the development of DNN and MLP models. First, traditional MLP in the 1980s typically had no more than three hidden layers. Contrastingly, networks nowadays almost always have more than three hidden layers as shown in Fig. 2. Additionally, when applied to regression problems, the ReLU function is used to replace the most commonly used activation function, i.e., the sigmoid. Novel parameter correction algorithms, such as the Adam optimizer, have also been proposed by Kingma and Ba (2014), which can significantly improve training efficiency. Dropout layers can also avoid overfitting problems by randomly shading neurons while training the networks. Using these methods, networks can become deeper, broader, and more accurate. However, although the expansion of the neuron numbers may increase the performance of model, it also causes the training time to be too long. Therefore, more and more improved neural networks were proposed after.

Recurrent neural network (RNN)
The RNN concept was proposed by Elman (1990). To increase the correlation of existing and subsequent terms in a time series, Elman added recurrent terms to create feedback as useful output information from the neurons in the hidden and output layers, i.e., the network reserves highly relevant information as the input of next-round prediction. This design makes RNN better at processing time series data such as natural language processing, speech recognition and weather forecasting. However, the gradient vanishing makes RNN unable to memorize longterm time series information, so the LSTM and GRU were generated.
The memory cell of RNN is shown in Fig. 3. Here, the RNN records the information from antecedent forecasting. The inputs for RNN are typically time-series data. Accordingly, the output at time t is highly correlated with the output at time t -1. The RNN cell will merge the input at time t and output information from forecasting at time t -1 using the hyperbolic tangent function. More details about the RNN can be found in Sherstinsky (2020).

Long short-term memory (LSTM)
The LSTM is an improved RNN created by Hochreiter in (1997). Based on its idiographic model construction, LSTM is more suitable for dealing with continuous information, e.g., time-series data. According to the principle of LSTM, it can more effectively forecast mid and long-term events compared with conventional ML algorithms and is even superior to the conventional RNN and hidden Markov model. The high accuracy of LSTM has made it one of the most commonly used DL algorithms for managing audio and natural language processing. Different from the principle of classic RNN, LSTM reforms the structure of the memory cell and is combined with three filters, i.e., input gate, output gate, and forget gate. These gates can serve as effective filters when information is passed to the next element using the sigmoid function. The memory cell of LSTM is presented in Fig. 4. More details about the principles of LSTM can be found in the studies by Hochreiter (1997) and Sherstinsky (2020).

Gated recurrent unit (GRU)
Cho et al. (2014) created GRU as another recursive DNN, which, in turn, evolved from the conventional RNN. The main structure of GRU is very similar to LSTM, and GRU also includes a reformed memory cell. Each memory cell also has self-connected neurons and gates. The main difference between GRU and LSTM is that the former was designed to only have two gates for overcoming problems caused by complex calculations. Memory cell construction in the GRU is shown in Fig. 5. Here, GRU combines the forget and input gates, which were combined in LSTM as an update gate. Similar to the forget gate being used in LSTM, the update gate also adopted the sigmoid function to control signal strength. In addition to the update gate, GRU used a reset gate to merge the input and antecedent information. More details about the GRU can be found in Cho et al. (2014).

Switched prediction (SP)
According to Lian et al. (2015), SP can more effectively integrate ensemble systems compared with EM. Wu et al. (2017) adopted a similar approach to the SP to forecast 1-6-h precipitation lead-times using several numerical weather prediction systems. Because of different initial parameters and model characteristics, each ensemble member could potentially misestimate trends within the real situation of an event. Since the main concept of EM is to directly average the results of all forecasting models. Using the EM to integrate an ensemble system poses a high risk of incorporating extreme errors in the results. To The memory cell of GRU overcome the shortcomings of EM, SP was employed in this study to combine seven ML algorithms introduced previously. This process can be illustrated in three steps. First, the length of benchmark hour N was determined, and the forecasting rainfall from time t -N to t was derived to calculate the performance measures. Second, suitable performance measure P was selected to evaluate the performance of combined results. In this study, due to the target variable being reservoir inflow, root mean square error (RMSE) was adopted. Finally, the number of preferred models (M) was determined, and all algorithms were sorted on the basis of performance measure P. The first wellperforming M algorithms were subsequently used to calculate the average inflow as a combined forecast. Using this mechanism, forecasts with extreme errors can be effectively eliminated. The forecasting results generated by SP will be compared with EM and optimal model. Among them, the optimal model represents the best performing one of the previous seven ML models. The construction of SP is presented in Fig. 6. Due to the space limitation, figures of simpler EM and optimal model would not be presented herein.

Determination of input combinations and hyper-parameters
In order to use the limited input factors and selected models to construct the most reliable and accurate forecast model, the combination of input factors and hyper-parameters determination are inevitable. In this study, to determine inputs and hyper-parameters, the following methods were adopted respectively. For the determination of input factors, the input factors would be increased progressively. In other words, gradually increase the input factors until the performance measures are no longer optimized. Due to the time series relationship between rainfall and inflow data, the closer the time is, the higher the correlation between the data. Therefore, by incrementally increasing the input factors to construct the models, the optimal model that uses the least input factors can be selected. Besides, for the determination of hyper-parameters used in ML methods and the parameters used in SP (N, P, and M), the most rigorous method, grid-search method, was used in this study. The concept of the grid-search method is to construct models for all combinations within the range of reasonable hyper-parameters and select the model have best performance measures. The more details about the grid-search method can be found in Lerman (1980). Fig. 6 The structure of SP. In the scenario of this figure, the RMSE of 7 models from b f tÀN to b f t would be calculated. According to these RMSEs, the best M models would be selected (M needs to be determined by grid-search method), and take the average of their forecasting results as the outputs of SP

Performance measures
To more objectively evaluate the pros and cons of the seven algorithms and two ensemble integration techniques, seven performance measures were adopted in this study, i.e., RMSE, normalized root mean squared error (NRMSE), mean absolute error (MAE), correlation coefficient (CC), correlation of efficiency (CE), error of peak inflow (EQ P ), and error of time to peak (ET P ). The purpose of the RMSE is to evaluate the difference between the forecasted and observed values. Particularly for the extreme value, using the sum of the squares of all differences between observations and forecasts, the RMSE values will comparatively emphasize errors in the peak value. The NRMSE is to normalize the original RMSE so that the indicators are not affected by data range. The MAE can effectively represent the error between predictions and real values. Different from the RMSE, all predicted values will be evaluated fairly, without a tendency to focus on any particular segment. The CC is used to illustrate the relevance between the forecasted and observed values. The closer the CC is to 1, the better the model's performance will be. The CE represents the degree to which the forecasts produced by a model are more accurate than forecasts using directly average. Similar to the CC, the closer the value is to 1, the better that model will perform. The detailed calculation of the RMSE, MAE, CC, and CE can be found in Wang et al. (2019). For NRMSE, please refer to Liu and Hwang (2015) and Aghelpour and Varshavian (2020). In addition, EQ P and ET P are used to evaluate a model's forecasting effectiveness regarding peak values. Their equation can be derived using Eqs. (1) and (2). The EQ P represents whether the peak value of inflow is close to the observed value. The closer the value is to AE0%, the better is the model's performance. Finally, ET P calculates the error between the timing of the peak value of the inflow forecasted by models and the real timing.

Research process
The research process is shown in Fig. 7. The data noted in Sect. 2.1 were used to construct seven ML models. Because the rainfall data from eight rainfall stations were too broad in scope to determine the input combinations and may have led to training difficulties, the rainfall data of the eight stations were merged into a mean value using Thiessen's polygon method. A comparison of each model was carried out in three phases. In the first phase, the differences between the conventional ML and DL algorithms were compared. Second, the performances of various networks with recursive terms were compared. Finally, the pros and cons of all the models were compared. In the second stage, the seven models were merged into ensemble forecasts using EM and SP. The seven models, SP, and EM were then ranked, and the optimal forecasted model was proposed.

Results
As mentioned in Sect. 2, seven ML algorithms and two ensemble forecasting methods were used in this study. The results and discussion have been presented in the following order. Section 3.1 lists the optimal parameters and input combinations of each model determined by the grid-search method. The methods used to evaluate the performance of models are also introduced in this section. Comparisons of the performance of each algorithm are presented in Sect. 3.2, which is divided into several segments. First, the MLP is compared with the relatively novel DL algorithm, (DNN herein). Second, several algorithms that applied recursive techniques are analyzed to derive the pros and cons of each recurrent unit. Finally, the performances of all seven models are compared, and the best single algorithm under different forecasting conditions is presented. This model's integrated techniques (used to merge the ensemble forecasts into a single result) are compared with EM and SP. Additionally, the confidence interval is calculated for the results of all models using the reliability analysis. The results of this analysis are presented in Sect. 3.3.

Optimal inputs and hyper-parameters
As previously mentioned, the optimization algorithm used in this study was the grid-search method, which took into account all the hyper-parameter combinations within a reasonable range. Rainfall and reservoir inflow information were the input factors that had to be determined for reservoir inflow forecasting. In this study, the best input combinations selected for reservoir inflow forecasting were the same for all models. Because of the runoff concentration time for the study area, the range of reasonable inputs includes rainfall and inflow information from a current to a 4-h lead-time, i.e., the information from t to t-4.
The optimal hyper-parameters and input combinations are listed in Tables 2 and 3, respectively. Table 2 shows the optimal combinations of the SVM, RF, and MLP.

Comparison of conventional machine learning and deep learning
According to the records in this study, the computation time of the seven algorithms constructed for on-line forecasting is less than 2 min. Therefore, this study pays more attention on the accuracy of the constructed models instead. This study first compares the difference between conventional ML and DL, which employ MLP and DNN as representatives, respectively, in 1, 4, and 6-h lead-time inflow forecasts for Shihmen Reservoir. Figure 8 shows the hydrographs of inflow forecasted by MLP and DNN. Under the 1-h lead-time forecasting condition, both MLP and DNN accurately forecasted the inflow characteristics, regardless of the rising limb, peak segment, and falling limb aspects, which stand for the timing that water level is rising, the maximum water level and water level is falling. For the peak value, the DNN forecast showed   comparatively lower overestimation compared with the MLP and more closely matched the observed inflow; thus, the lag time generated by the DNN was shorter than that generated with the MLP. In the 4-h lead-time forecasting case, the inflow hydrographs forecasted by MLP and DNN shared similar trends, particularly for low inflow, where predictions were accurate. For peak-value forecasting, both the MLP and DNN tended to underestimate the peak value and generated a degree of lag time. Notably, regardless of lead-time conditions, the MLP tended to indicate a longer lag time compared with the DNN. In a 6-h lead-time forecasting scenario, the DNN tended to underestimate to a larger degree than the MLP; when forecasting the peak value, the MLP also indicated an unstable forecasting status regarding aspects other than the peak value. Overall, in short lead-time forecasting conditions, both the MLP and DNN obtained relatively good forecasting results. In most cases, however, the DNN obtained more stable and accurate forecasting results compared with the MLP for Shihmen Reservoir under extreme events.

Comparison of recursive networks
In this segment, the RNN, LSTM, and GRU (which involved recursive techniques) are evaluated and analyzed.
The results forecasted by the three algorithms are shown in Fig. 9. Figure 9a shows the forecasting inflow under 1-h lead-time forecasting. The hydrographs generated by all algorithms accurately forecasted the inflow at each stage of the typhoon in question. However, for peak-value forecasting, the three algorithms yielded different degrees of overestimation. The order from most severe to smallest overestimation is RNN, LSTM, and GRU. The lengths of lag time for peak value generated by all the models were essentially the same. In general, the three algorithms could forecast results accurately in the 1-h lead-time forecast scenario. For the 4-h lead-time forecasting scenario, as Fig. 9b shows, the three algorithms still stimulated the trend of time series of inflow. This differed for the 1-h lead-time scenario, where all three algorithms tended to underestimate the peak value and generated a longer lag time. Before reaching the forecasted peak value, the inflow hydrographs forecasted by the three algorithms were approximately the same. However, the peak forecasted by the RNN occurred later compared with those forecasted by the two other algorithms and included a greater error related to the observed inflow. The results of long-term forecasts below a 6-h lead-time condition are shown in Fig. 9c. Three separate time series (one for each algorithm) obtained a longer lag time compared with 1 and 4-h lead-     time forecasts. Additionally, the three algorithms showed similarities in 4-h lead-time forecasting cases, which tended to underestimate the peak value. Here, the severity of underestimation, ranked from high to low, is RNN, GRU, and LSTM. The hydrographs forecasted by LSTM and GRU showed similar trends. However, the inflow hydrograph forecasted by RNN could not effectively derive the characteristics of the observed inflow. The twin peaks and bumpy rising limbs demonstrate that its forecasting ability is not as good as LSTM and GRU in the long-term forecasting scenarios. In summary, for short lead-time forecasting scenarios, all the networks with recursive techniques could effectively forecast the inflow and did not generate severe error estimations or lag times. However, when lead time gradually became longer, LSTM and GRU indicated better stability and generated more accurate forecasting performances compared with RNN. Overall, LSTM and GRU showed advantages for using single algorithms to forecast the inflow in Shihmen Reservoir. On the other hand, according to the Fig. 9c, the hydrograph forecasted by RNN also reserved advantages, such as the rising limbs in a 6-h leadtime forecasting case. Prior to reaching the first peak forecasted by RNN, the curve trend of RNN was observably more accurate than that predicted by LSTM and GRU. These advantages will subsequently be applied to the SP method in Sect. 3.3.

Comparison of all models
In this section, all algorithms are compared to denote their pros and cons. The inflow hydrographs forecasted by all algorithms are shown in Fig. 10. Figure 10a shows the 1-h lead-time inflow hydrographs forecasted by all the models. In these algorithms, except for SVM (which tended to underestimate the peak value), the remaining algorithms show slight overestimation. The rationale for the underestimation of SVM may be related to the kernel function used in this model being a radial basis function, which can fit all curve trends well but can lead to underestimation of the peak value while the forecasted event having the maximum flow among selected events.
The differences between algorithms were observed primarily when the lead-time became longer than 3-h. Figure 10b shows the 4-h lead-time forecasts for all seven algorithms. The forecasts for the low inflow segment by all algorithms were similar to each other and were the same as the observed values. Regarding rising limbs, all seven algorithms tended to underestimate the inflow relative to the observed value. Conversely, for falling limbs, all seven algorithms tended to overestimate the inflow because of the average 1.5-h lag time, resulting in right-shifting of the forecasted time series. The analysis of peak flow showed that, except for RF, the remaining algorithms were inclined toward varying degrees of underestimation. According to the Fig. 10b, the order of severity regarding underestimation is as follows: SVM, GRU, LSTM, RNN, MLP, and DNN. However, the lag time generated by each algorithm can be arranged from short to long as follows: LSTM was similar to GRU and SVM, but shorter than DNN, MLP, and RNN. Concerning the peak value forecasted by RF, in contrast to the other algorithms, it tended to overestimate the peak value and produced a lag time similar to DNN. The hydrographs of inflow forecasted for a 6-h lead time are shown in Fig. 10c. The 6-h lead-time forecasts were less accurate than the 1 and 4-h lead-time forecasting results. Underestimation is more serious in this context; it can be sorted from severe to minor as GRU, SVM, RNN, DNN, LSTM, MLP, and RF. Furthermore, from short to long, the lag time can be ranked as RNN, MLP, RF, DNN, LSTM, RNN, GRU, and SVM. In particular, although the RNN produced the shortest lag time, the curve forecasted by this algorithm had double peaks, which was observably different from the actual situation, indicating that it may be overly sensitive or insufficiently robust for use in extreme events.
The performance measures of all algorithms in the testing phase are shown in Table 4. The performance measures of each model are the average of the forecasts for all typhoons. Under a total of 21 conditions with seven indicators and three forecast lead times, GRU achieved the best performance in nine cases and its performance have the improvement of 7.63% and 6.4% in the RMSE and MAE compared with second-class algorithms. In addition, SVM achieved the best performance in four other conditions (the best performance after GRU). GRU could outperform conventional ML models (MLP, RF and SVM) owing to its complicated architecture and handling of time series through its memory cell operations. The memory cells have ability to filter and memory data features making as function to construct rainfall-runoff nonlinear correlation. Although the above results indicate that GRU can efficiently perform in terms of average performance measures, it may not show the best stability in all events can be proposed, based on the indicators and hydrographs of each particular event. Based on the analysis in Sect. 3.2, other algorithms have some advantages over GRU (e.g., peak- Fig. 10 Model comparison of all models using Typhoon Soudelor with (a) 1-h lead-time, (b) 4-h lead-time, and (c) 6-h lead-time value forecasting). Accordingly, in Sect. 3.3, two methods that are used to combine various algorithms are proposed, and the ability to draw on the advantages of combined algorithms is explored.

Performance of switched prediction
As mentioned in Sect. 3.2, according to the performance measures, although GRU and SVM may have performed better in most cases, the remaining algorithms still indicate advantages in particular contexts. It will be unreasonable to select the best algorithm among seven and ignore the advantages of the remaining six when the aim is to generate the stable and reliable forecasts. Therefore, in this section, two methods, (EM and SP) are employed to integrate the seven models.
As in the sections comparing algorithms, due to limited space, the 1, 4, and 6-h time-lead forecasts to represent short, medium, and long-term inflow forecasts were selected herein. To render the analysis representative, the hydrographs presented herein include typhoons representing the top three largest inflow during the test sessions, i.e., typhoons Soudelor, Megi, and Maria. Table 5 shows the best parameter combinations used in SP after calibration, where N represents the length of forecasts used to evaluate the ranking of algorithms, P is the performance measure used for ranking, and M is the  Fig. 11. The red and blue lines represent the forecasts integrated by SP and EM. The gray lines and areas represent the seven algorithms presented in Sect. 3.2 and their calculated 95% confidence interval. The best parameter combinations determined for the SP under a 1-h lead-time forecasting condition was N4M4, where N4 denotes that forecasts from t to t-4 were used to calculate the performance measures and determine which algorithms performed better under such conditions. Concurrently, M4 indicated that SP would select the top four algorithms and calculate their average as the forecasting value. On the other hand, the EM was employed to directly average the results of all algorithms.  Fig. 11 shows, with a 1-h lead-time, either EM or SP was able to obtain fairly accurate forecasts for three typhoons, particularly regarding the low inflow value. For peak forecasting, both methods may exhibit slight error estimations to the same extent. Therefore, the individual performance of the two integrated methods would be similar in a 1-h lead-time forecasting case. Figure 12 shows the 4-h lead-time forecasting hydrographs obtained from the EM and SP. The results of the integrated SP were significantly more accurate than the results generated by the integrated EM. During the low water-level period, in the case of rising and falling limbs, both methods tended to overestimate the inflow. However, unlike the unstable EM, which would alternative overestimation and underestimation, the results forecasted by SP were comparatively more stable and controllable. The comparison of the forecasting peaks for the three events observably indicated that SP could effectively improve the accuracy of peak forecasting and reduce lag time.
The results of the long-term forecasting, i.e., 6-h timelead forecasts, are shown in Fig. 13. Under these conditions, the combined forecasts generated by the two methods showed longer lag times than the previously mentioned 1 and 4-h lead-time forecasts; this was because the results of the original seven algorithms each incorporated a certain degree of lag time. Based on typhoons Soudelor and Megi, SP provides more advantages than EM, regardless of peak forecasting or the prediction of other elements. Particularly, in the case of Typhoon Maria, because most algorithms were unable to accurately forecast the inflow, neither SP nor EM could effectively generate improved integration results.
The RMSE and the rankings of the seven algorithms, as well as EM and SP forecast results, are listed in Tables 6, 7, and 8, and present the forecast results of 1, 4, and 6-h leadtime forecasts, respectively. As shown in the column headings, the events listed in the Tables 6, 7 and 8 represent the average training RMSE obtained from 13 training typhoons, as well as the top three typhoon events in the observed inflow of the test events (typhoons Soudelor, Megi, and Maria). Under 1-h lead-time forecasting conditions, DNN achieved the smallest RMSE value in the two test events, as well as the smallest average among all the events in the test sessions. However, during Typhoon Megi, DNN observably regressed from being the most accurate model to being in the sixth position. The reason for this could have been the insufficient stability of using a single algorithm to effect. Conversely, the forecast results integrated by SP maintained the second and third positions in the three test events, and their average performance and stability was better compared to EM, which represented the third and fourth positions. Concerning the 4-h lead-time scenario, using the RMSE, no single algorithm that could achieve a stable and outstanding performance in the three test events. In terms of average performance, GRU achieved second place but also indicated instability in its forecast for Typhoon Maria. Contrastingly, the integrated forecasts generated by SP could be stabilized in all algorithms to obtain the top two forecasting performances, and even the best forecasts for the average test sessions when using long time-lead forecasting scenarios. Finally, for long-term 6-h lead-time forecasts, although SP ranked at  Overall, the integration of multiple algorithms using SP can effectively merge the advantages of all algorithms and enable them to exert their respective advantages in various situations. Based on the results in the above figures and tables, SP has a high degree of stability and accuracy compared with that when using a single algorithm forecast or EM integration. Hence, in this study, it is recommended that SP be used with various ML algorithms for ensemble forecasting to improve forecasting accuracy and enhance its practical use.

Discussion
In this study, the rule that how to select the input combinations and the hyper-parameters of ML models under typhoon events is proposed. In addition, the benefits of using integrated methods to generate the ensemble forecasting are verified.
In first part, based on the Table 2, the conclusion can be drawn that forecasts under short lead-time conditions may rely more heavily on the early rainfall information. Conversely, long lead-time forecasting (e.g., 4 and 6-h cases) will focus more on runoff information. The reason of this phenomenon is due to the short time of concentration (about 5 h) in this study area. Table 3 shows the optimal combinations for DL algorithms. Based on the number of hidden layers and the neurons in each, we hypothesized that RNN, LSTM, and GRU (networks that adopted the recursive technique) would be prone to exhibiting a more complex architecture when processing relatively longer lead-times. Contrastingly, DNN tended to use fewer hidden layers and neurons, with longer lead-times. A possible rationale for these patterns is that the specific recursive technique included in three types of RNN, as well as evolutionary networks, can effectively improve the solution for space-fitting performance under complex fitting conditions. However, the over-complicated structure of the conventional DNN may give rise to significant biases.
In second part, the integrated methods indeed have higher accuracy and stability. In terms of EM, the assumption is that all models have their own advantages and errors in each scenario. These differences can be found in Sect. 3.2.3. Therefore, the EM, simple average the results of all models, is mainly to avoid the large errors in the forecast results. However, the SP method consider the time series characteristic of reservoir inflow, which have highly time dependent, the models that performed better in the previous time might also have better performance later. Calculate the RMSE between the forecasting values and measures one, then select the outperforming models to average as the final forecasts. According to the results, this time series assumption is indeed useful. SP can perform better than any single and EM integrated model. Worthy mention, the selection of RMSE is based on the grid-search method and the particularity of the typhoon event. The forecasting system during typhoon events cares about the arrival time of peak and flood volume; if applied to other scenarios, it is recommended to use other indicators such as MAE.

Conclusion
The main objective of this study was to compare the effectiveness of various commonly used ML methods and novel DL models for developing reservoir inflow forecasts. Two methods were adopted to integrate the seven algorithms for integrated forecasting, which would enable the subsequent model to forecast inflow more stably and accurately in various scenarios.
The results showed that the seven algorithms have different pros and cons in the case of different typhoons and lead-time conditions. In the comparison of conventional ANNs and DNNs, based on the performance measures, DNNs were more accurate compared to ANNs in most scenarios. However, hydrographs still indicated that under the extended lead-time forecasting conditions, the underestimation of DNNs is more serious compared with those of ANNs. For the comparison of recursive-based networks, GRU achieved better performance according to trend-oriented performance measures (CC and CE). Conversely, in RMSE, MAE and NRMSE, which can be used to evaluate the performance of algorithms in terms of extreme value and average error, respectively, RNN delivered better performance in the case of long lead-time forecasting. This indicates that the algorithm is not more complicated (GRU and LSTM), has more advantages, and will still reflect distinct advantages and disadvantages according to different application conditions. When comparing all the algorithms under the same conditions, the best results were obtained by GRU in most of the performance measures. In a small number of performance measures and situations, RNN, DNN, and SVM indicated better performance, although the improvements involved were not notable. However, although GRU showed an advantage in most performance measures, the assumption that it may not show the best stability in all events can be proposed, based on the indicators and hydrographs of each particular event. Hence, integrated methods, i.e., EM and SP, were used to address these problems.
Whether using EM or SP, the performances of integrated forecasting were generally better compared with that when using a single algorithm, indicating the stability and practicality of the integrated forecasting in real applications. In particular, the results of SP obtained the top three performances in all the events included in this study, and even achieved the best performance on average among all events in the long lead-time forecasts, surpassing the best single algorithm performance, regardless of 4 or 6-h leadtime forecasting.
In conclusion, seven MLs were used to forecast the reservoir inflows under extreme events. The results showed that despite establishing a comparatively best algorithm on average, this did not mean that this method could be used stably in all scenarios. Contrastingly, the results obtained using the integrated methods were better ranked in test Typhoons events and averaging performance, and these approaches could be more stably applied in practical applications. They also indicated higher credibility, indicating that the model could effectively assist in the operation of water resources restoration and disaster prevention.

Software information
The SVM, RF, and MLP models were created using the Scikit-learn (v. 0.22.1) software package in Python. The DL algorithms, DNN, RNN, LSTM, and GRU, are presented in ''Model comparisons'' Section. These models were created using Tensorflow (v.2.1) and Keras (Google).

Appendix B
Comparison between measured and forecasting inflow Fig. A1 presents the scatter plots between observed and forecasts inflows. As the figure shows, although each model shows uncertainty under different lead-times, the overall accuracy of the DL models will be higher than that of the conventional ML models, especially for the GRU and LSTM.