An Integrated Framework of GRU Based on Improved Whale Optimization Algorithm for Flood Prediction

Accurate prediction of floods is the first step in formulating flood control strategies and reducing flood disasters. This research proposes a deep learning model based on Gate Recurrent Unit (GRU), Random Forest Algorithm (RF), Whale Optimization Algorithm (WOA) and Optimal Variational Mode Decomposition (OVMD) for flood prediction. First, the random historical time series is decomposed using OVMD. Secondly, combined with the RF feature importance measurement, select features with high importance to obtain the optimal input set. Third, use the GRU model to predict all sub-models, and use the WOA algorithm to optimize the hyperparameters in the GRU model. This study also proposes a hybrid strategy to improve the traditional WOA algorithm and enhance the optimization ability of the WOA algorithm. Finally, the prediction results of all sub-modes were aggregated to generate the final prediction result. The model was validated using data from three hydrological stations in the upper, middle and lower reaches of the Minjiang river basin in China. Through the results of the experiment, it can be seen that the proposed prediction model can effectively predict the flood time series, and has better accuracy than other models.


Introduction
Floods are one of the most common and destructive disasters in nature, causing irreversible damage to the environment, human life and construction facilities (Akbari et al. 2020). Since 2000, more than 1.6 billion people worldwide have been affected by floods (Allahbakhshian-Farsani et al. 2020). The deaths caused by flood disasters account for about 84% of the deaths caused by all natural disasters in the world (Jamali et al. 2020). Flood disasters are more dangerous than other natural catastrophic disasters (Chapi et al. 2017). It is expected that the impact of flood disasters will continue to increase in the future (Furquim et al. 2016). Due to the great danger of floods, research on flood forecasting has become very important. In recent years, scholars and researchers have achieved excellent results in this regard. Flood prediction models can be divided into physical models and data-driven models (Masrur Ahmed et al. 2021).
The physical model is mainly modeled by simulating the internal physical mechanism of the runoff process. They are suitable for basins with a large number of physical parameters, but they usually need massive amounts of data to calibrate and validate the models (Huo et al. 2020;Hussein et al. 2019;Partington et al. 2012). Physical models usually require large amounts of hydrological and meteorological data, complex mathematical knowledge and accurate knowledge of the physical processes of runoff formation (Xie et al. 2019). There are many limiting factors in the application process, which leads to the uncertainty of the model (Yoon et al. 2011).
The data-driven model is built using machine learning algorithms and statistics, and does not require specific physical relationships within the runoff. It uses the historical runoff data to infer the relationship between input variables and output variables (Deo and Şahin 2015). Data-driven models can be considered as black-box models and are very useful in natural system modeling (Yaseen et al. 2016). Many nonlinear models are also used for hydrological prediction in the development of machine learning. Nikoo et al. (2016) proposed a method of artificial neural network (ANN) to predict floods and used a sociality-based algorithm to train the weight of ANN. The prediction result of this model is more accurate than the prediction result of the statistical model. Tehrany et al. (2015) proposed a flood prediction model that integrates support vector machine (SVM) and frequency ratio, and compared with the decision tree model to verify the effectiveness of the method. Bui et al. (2019) proposed a PSO-ELM model for the spatial prediction of flash floods. Extreme learning machine (ELM) is used to generate the initial flood model, while PSO is used to optimize the model. Compared with SVM, ANN and DT, PSO-ELM shows better performance. Zhou et al. (2019) proposed a flood prediction model of a fuzzy inference system embedded with genetic algorithm and least squares estimator. Moreover, compared with two different structures of the adaptive network-based fuzzy inference system (ANFIS), the superiority of the proposed model is verified.
In recent years, deep learning has made great progress in many fields. It has a deeper nonlinear network structure than traditional machine learning methods ). Recurrent Neural Networks (RNN) can quickly solve the mapping relationship between input and output sequences using context-related information. Once the neural network and the time span continue to grow recursively, the input of the hidden layer will gradually decay the output of the neural network, causing the gradient to disappear and gradient explosion (Ni et al. 2020). Long short-term memory (LSTM) and gated recurrent unit (GRU) networks are variants of RNN and are proposed to improve the shortcomings of RNN. Their basic idea is to give the recurrent neural network the ability to control the accumulation of its internal information through the gated unit. This function can selectively forget information to prevent overload in the learning process.
Although GRU and LSTM perform very similarly on many questions, we chose GRU for our experiment. Because the GRU model has fewer parameters and a simpler structure, it has a faster learning curve ). The training time of the GRU model is shorter than that of the LSTM model (Stergiou and Karakasidis 2021).
Many scholars have applied GRU to hydrological prediction or established related models, and achieved satisfactory results (Ayzel and Heistermann 2021;Pan et al. 2020;Wang et al. 2020a). Some scholars will use optimization algorithms to optimize certain parameters of the GRU to improve the performance of the GRU. This study introduces the Whale Optimization Algorithm (WOA) to enhance the predictive ability of the GRU model. The WOA algorithm has a simpler structure, fewer internal parameters, and stronger search capabilities than other optimization algorithms (Du et al. 2021).
However, the WOA algorithm is easy to fall into a local optimal state (Sun and Zhang 2018). This research improves the traditional WOA algorithm to reduce the impact of this problem.
Flood flow data is usually noisy and non-stationary, and direct use of the data for prediction will reduce the accuracy of the accuracy. In order to improve the prediction accuracy, the use of various signal decomposition techniques in the hydrological field has become a trend. Such as Wavelet Transform (Shoaib et al. 2014;Wu et al. 2021), Empirical mode decomposition(EMD) (Huang et al. 2014) and Variational Mode Decomposition (VMD). However, when EMD decomposes the signal, problems such as end effects and mode mixing will occur. VMD has a better decomposing effect than EMD (Fu et al. 2019). Compared with traditional decomposition techniques, VMD can recursively convert non-stationary signals into a series of orthogonal intrinsic mode function (IMF) sequences with natural frequencies. At present, there are many applications of VMD technology in hydrological time series (Fu et al. 2018;Xie et al. 2019;Zuo et al. 2020). In order to better apply VMD to time series decomposition, this study uses the optimized variational mode decomposition (OVMD) method proposed by Zhang et al. (2017).
Section 2 introduces the theoretical background of this article in detail, including OVMD, RF, WOA, improved WOA (IWOA) and the proposed OVMD-RF-IWOA-GRU model; Section 3 gives a case study, which specifically includes the research area, data processing, determination of input variables and parameter settings; Section 4 compares the prediction results of different models; Section 5 gives the conclusion of this research.

Optimal Variational Mode Decomposition
Variational Mode Decomposition (VMD) is a new non-recursive adaptive decomposition processing method proposed by Dragomiretskiy and Zosso (2014), which is used to replace empirical mode decomposition or empirical wavelet transform.
The purpose of VMD is to decompose the real-valued input signal f into different numbers of sub-signals with limited bandwidth, namely Intrinsic Mode Function (IMF), which can reproduce the original input signal according to its sparsity characteristics (Mohanty et al. 2020).
For a signal   ft, its corresponding constrained variational mode is as follows: where K is the total number of modal decomposition, is the quadratic penalty term,    is the inner product operation,  is used to ensure the accuracy of the reconstruction, and   t  is the lagrangian multiplicative operator.
The OVMD decomposition in this paper is based on the method proposed by Zhang et al. (2017). The mode number and updating parameter t of OVMD were determined by decomposing the mode center frequency and residual evaluation index.

Random Forest
Random forest (RF) is an integrated learning algorithm based on bagging algorithm thought proposed by Breiman (2001). RF can effectively analyze highdimensional nonlinear data, and has good generalization ability and predictive performance (Hidayat and Astsauri 2021). RF has strong generalization ability and is not easy to overfit, the model is relatively simple, and the classification effect is good.
RF uses bootstrap resampling to randomly extract data from the original sample to construct multiple samples, and randomly split nodes to construct multiple decision trees for each sample. The final prediction result is determined by the average of the output values of multiple decision trees in the random forest. The training set of each decision tree in RF is composed of N samples randomly selected from the entire training sample set. This process is called self-service resampling, and the resulting sample set is called bootstrap set. The samples in the tree training set are called out-of-bag (OOB) data, and feature importance can be calculated from these OOB data.
RF has the characteristics of calculating the importance of a single feature variable, so this paper uses RF to perform feature selection on sample data. The detailed calculation steps of the importance of feature i x are as follows: (1) Initialize RF, set k = 1, build a decision tree k T on the kth bootstrap sample set, and record the OOB data set corresponding to the bootstrap set as oob k L ; (2) Classify the OOB data oob k L based on k T , and count the number of correct classifications, denoted as oob k R ; (3) Randomly change the value of the feature i x in oob k L , and record the perturbed OOB data set as (4) For k = 2,3,…,K, repeat steps (1) ~(3), K is the number of decision trees; (5) Calculate the importance of feature i x , the formula is as follows: where i impot is the importance of the feature i x , K is the number of decision trees, Sort P in descending order to get the feature importance ranking, the higher the ranking, the higher the importance of the feature.

Gated Recurrent Unit Networks
Gated Recurrent Unit (GRU) was proposed by Cho et al. (2014). GRU model is an improved LSTM network algorithm. The basic unit of the LSTM hidden layer is a memory block structure composed of input gate, forget gate and output gate. Through the control of these three gates, update the history information. The Forget Gate controls the information retained to the current last moment; the input gate is responsible for regulating the current input; the output gate is responsible for regulating the current output. The function of GRU is the same as LSTM, but the structure of GRU is simpler than LSTM. GRU merges the forget gate and input gate of LSTM into the update gate, and the output gate becomes the reset gate. The update gate is used to choose to retain the state information at the previous moment. The reset gate mainly determines how much past information needs to be forgotten. This makes the structure of the GRU network much simpler than that of the LSTM network, and does not add any additional internal state. Through this structure, the input information of the previous module is forgotten and added, and finally the new state and output of the current module are obtained through calculation. Continuously adjust the weight values in these structures through learning, until the output meets the requirements, and finally achieve the purpose of screening information, controlling data flow, determining whether information is transmitted, and simulating the reasoning process. The calculation process of the GRU network is as follows: (1) Calculate reset gate t r : (2) Calculate the proposed hidden state t h % : (4) Calculate the hidden state t h :

Standard whale optimization algorithm
Whale optimization algorithm (WOA) is a meta-heuristic optimization algorithm proposed by Mirjalili and Lewis (2016). WOA achieves the optimization goal by simulating the unique bubble net foraging strategy of whales. The mathematical model is established by imitating its foraging. In order to reduce the control variables in the algorithm, the velocity vector is deleted and the position vector is retained. In this way, the search capability of WOA is enhanced.
In the whale optimization algorithm, there are three strategies for whale position updating: encircling prey, bubble-net attacking method and search for prey. Assuming that the population number of whales is N, the position of whale in the D-dimensional search space can be expressed as   ij X , where i =1, 2,... , N. j = 1,2... , D. The three strategies are described as follows: (1) Encircling prey where t is the number of current iterations,

 
Xt  is the current best whale position vector,

 
Xt  is the current position vector of the whale, A and C are the coefficient vectors, and they are defined as follows: where r is a random number in the interval [0,1], and (2) Bubble-net attacking method Humpback whales hunt by spiral motion, and a mathematical model of spiral position updating can be built. The distance between the whale group and the prey is calculated, and the spiral swimming behavior of the humpback whale is imitated to establish a mathematical model of the spiral movement. The expression is as follows: where p D is the distance between prey and whale, b is the logarithmic spiral coefficient, and l is a random number in the interval of [−1,1].
It is worth noting that while the whale swims toward its prey in a spiral shape, it also shrinks its enclosure. Therefore, this behavior is manifested in the model as a 50% probability to choose the shrinking envelopment mechanism and a 50% probability to choose the spiral model to update the position of the whale. The mathematical model is as follows: where p is a random number in [0,1].
(3) Search for prey Whales use |A| to control whether they are in the search for prey or in the encircling prey. When |A| >1 , the whales are unable to get useful information to the prey and they need to keep trying to find their prey by using random clues of information. The mathematical model for the search of the prey is as follows: Xt is a random position vector of the whale population.

Improvement of whale optimization algorithm (IWOA)
The standard whale optimization algorithm has shortcomings such as slow convergence speed, low convergence accuracy, and easy to fall into local optimum, so this study improves it. The three improvement strategies are as follows: (1) Chaos population initialization The traditional WOA algorithm solves the optimization problem by randomly generating the initial population, which makes the initial population unevenly distributed and reduces the optimization speed. The chaotic motion has the characteristics of randomness, regularity and ergodicity (Paul et al. 2020). Using these advantages can generate an initial population with better diversity and improve the global search ability.
The Tent mapping expression is as follows: When u = 0.5, the generated sequence is uniformly distributed, it is not sensitive to the changes of different parameters, and has an approximately uniform distribution density. The most typical Tent mapping is expressed as follows: The Tent map is used to generate the initial population for the WOA algorithm in this study.
(2) Nonlinear convergence factor and inertia weight The function of parameter A in the standard WOA is to adjust the global exploration and local development capabilities of the algorithm. The convergence factor a decreases linearly during the iteration process, which easily makes the search speed of the algorithm unable to adapt to the actual situation. The global search and local search capabilities of WOA are affected by the convergence factor a. In this paper, the convergence factor a is changed to nonlinear, and the formula is shown in eq. (17).
As the number of iterations increases, the change rate of the convergence factor is firstly slower and then faster. At the beginning, the attenuation rate of a is slower, so that the algorithm can optimize globally. In the later stage of optimization, the attenuation of a becomes faster, which corresponds to the precise local optimization of the algorithm.
The different attenuation levels of a correspond to the different search performances of WOA, which effectively improves the development capability of WOA.
where t is the current number of iterations, and max_iter is the maximum number of iterations.
In the iterative process, the standard WOA algorithm does not take into account the possible differences in the guiding force of the prey to guide the whale to update its position. Therefore, combined with the idea of inertial weight guiding the population optimization in the PSO algorithm, the WOA algorithm introduces adaptive parameters as the inertial weight in the position update process in Eq. (12) ~ (14). Dong et al. (2021) and Yan et al. (2020) proposed a nonlinear decreasing inertia weight and both improved the search ability of WOA. The large initial weight will lead to a wider search range, while the latter option value is small, so that the search can be limited to a local area.
When the whale is close to the target, a smaller weight is used to change the position of the best whale. This facilitates fast global search and fine local search. This study proposes a new nonlinear decreasing inertia weight to improve WOA. The inertial weight is as follows: The specific improvement for Eq. (12) ~ (14) of the WOA are as follows: where t is the current number of iterations, and max_iter is the maximum number of iterations. With the change of t, the inertia weight gradually decreases, realizing the dynamic change of the weight, so that the global search and local development of WOA are better balanced.
(3) Gauss perturbation Gaussian disturbance is a random disturbance with Gaussian distribution, in which the Gaussian distribution is centered on a random variable. In each iteration, Gaussian perturbation is performed on the current global optimal individual, which can increase the probability of obtaining a better solution. Gaussian mutation originated from Gaussian normal distribution, which has been proved in many algorithm improvements, such as grey wolf optimizer algorithm (Peng and Zhou 2019), artificial bee colony algorithm (Xiao et al. 2021), ant colony optimization algorithm (Tuani et al. 2020), PSO (Wang et al. 2020b), and also used by researchers in WOA (Luo et al. 2019). It is more likely to produce new offspring near the primary parent, which will cause the current optimal individual to update its position in small steps. Gaussian mutation can make the WOA algorithm converge faster and enhance the search ability of the algorithm (Luo et al. 2019). The probability density function of the Gaussian distribution is expressed as follows: It can be seen from Eq. (20) that the larger  is, the smaller the probability value of position x is, indicating that the curve is flatter and the probability distribution is more dispersed. Conversely, the smaller  is, the steeper the curve is and the more concentrated the probability distribution is. According to the characteristics of the distribution curve, the  value in this paper is 0; that is, the peak of the distribution curve is located at 0. In the later iterations of the algorithm, when Gaussian perturbation is applied to the current optimal individual, the nearby area can be explored and the search ability of WOA can be improved. If  value is too large, the variation amplitude will be too large, and the implementation of Gaussian perturbation will easily lead to the deviation of the algorithm. It is difficult to find the global optimal advantage.
If the value of  is too small, the disturbance to the search individual is insufficient, unable to help the individual successfully get rid of the shackles of the local extreme value, and unable to achieve the purpose of increasing the diversity of the search population. Therefore, the value of  in this paper is 1, which can get the best effect.
When  = 0 and  = 1, the density function is as follows: The Gaussian perturbation formula for the optimal individual in WOA is as follows: where (0,1) noemrnd represents a random number that generates a Gaussian distribution with a mean of 0 and a variance of 1.

Training procedures of GRU using IWOA
The basic idea of the training procedures of GRU using IWOA is as follows: using the IWOA algorithm to optimize the learning rate and the number of hidden layer units of the GRU model, taking the GRU training error as the individual fitness value, and selecting the optimal GRU model parameters. The specific process is as follows: Step 1: Collect flood flow data, preprocess the original data, eliminate outliers, and use cubic spline interpolation to fill in missing values.
Step 2: Use RF to measure the importance of the decomposed data. Data with high importance were selected as input variables of GRU. The entire data set is normalized to [0,1] and divided into training set and test set.
Step 3: Initialize GRU model parameters, including batch size, model training times, and dropout value.
Step 4: Initialize IWOA, set the number of populations N, the maximum number of iterations max_iter, the search range. The initial population of the model is generated by chaos mapping in the feasible region.
Step 5: The fitness value was calculated and sorted to find the global optimal solution. According to Eq. (21), Gaussian perturbation is performed on the global optimal solution to improve the position of the optimal solution.
Step 6: According to the values of A and P, different formulas are used to update the position of the next generation of whales.
Step 7: t= t+1. If _ t max iter  , go to Step 5, otherwise go to Step 8; Step 8: Assign the current parameters to the GRU model, and make predictions based on the obtained parameters and the test data set.    Table 2. For high-field sites, when K = 7, similar frequency components begin to appear, that is, excessive decomposition occurs. Therefore, the total number of patterns at the Gaochang site KGC = 6, the total number of patterns at the Zhenjiangguan site KZJG=7, and the total number of patterns at the Dujiangyan site KDJY = 6. In general, the update parameter s is set to 0, so that the Lagrangian multiplier remains at 0. The changes of the VMD decomposition residual index REI of the three sites are 24 shown in Fig.3. The smallest possible residual error in signal decomposition can improve the prediction accuracy. When the REI of the high field site is the smallest, τ=0.95; when the REI of the Zhenjiangguan site is the smallest, τ=0.57; when the REI of the Dujiangyan site is the smallest, τ=0.96.

Determination of the input variables
Before using the hybrid model to predict flood volume, the input matrix needs to  The optimizer chosen in this paper for GRU training is the Adam optimization function. The learning rate of GRU is set to 0.002. The number of hidden layer nodes is set to 100. The batch size is set to 36. In other words, the model will update the parameters after processing 36 sets of data. This operation makes the direction of gradient descent in neural network learning more accurate, and the resulting training fluctuations reach a relatively stable state . In order to prevent the model from overfitting, the dropout parameter is set to 0.2, which makes 20% of the random neuron nodes in each hidden layer fail. For using the optimization algorithm to optimize the learning rate and the number of hidden layer nodes of the GRU model, it is necessary to set the range of optimized parameters, where the range of the learning rate is set to [0.001, 0.02], and the number of hidden layer nodes is set to [1,500].

Performance metrics
This paper selects four commonly used indicators to verify the effect of the proposed combination forecasting model. Including: root mean square error (RMSE), correlation coefficient (R), mean absolute error (MAE) and mean absolute percentage error (MAPE). The calculation formulas of the three indicators are as follows: represent the average of observed and predicted data; N is the length of the time series.
For the three indicators RMSE, MAE and MAPE, the closer the value of the indicator is to 0, the closer the predicted value is to the observed value. The closer the value of R is to 1, the closer the predicted value is to the observed value.

Comparisons results
This section compares the method in this paper with a single prediction model and a combination of different prediction models. This article uses a total of 10 prediction models to predict and compare the water volume of three stations, and these models use the same data set.

Comparison of predicted results of the Gaochang site
This section first compares and analyzes the results predicted by the ten models.
In order to show the error metric more intuitively, the RMSE, R, MAE, and MAPE values of ten models and three data sets are illustrated using horizontal histograms and radar charts, respectively. Because Gaochang, Dujiangyan and Zhenjiangguan are located in the lower, middle and upper reaches of the Minjiang River, respectively.
Therefore, it can be found from the table that the data set of the Gaochang site has the largest error, and then the data set of the Dujiangyan site, and the data set of the Zhenjiangguan site is the smallest. The prediction results of the ten models are shown in Table 3.     Table 5 and   Table 6 show the RMSE, R, MAE, and MAPE values of the two sites, respectively.
The corresponding horizontal histogram and radar chart are shown in Fig. 12 and Fig. 14, respectively. Fig. 13 and Fig. 15 show the comparison between the OVMD-RF-IWOA-GRU model and the ELM, GRU, WOA-GRU, and RF-WOA-GRU models for the two sites, respectively. From Table 4-5 and Fig. 12-15, the same conclusions as in section 4.1 can be drawn.    (1) The OVMD-RF-IWOA-GRU model obtains the smallest MAE, MAPE, RMSE and the largest R, which proves that the predictive ability of the OVMD-RF-IWOA-GRU model is superior than the comparison model.
(2) The GRU model is better than the ELM and LSTM models, which proves the effectiveness of the GRU network in processing hydrological time series data.
(3) The GRU models optimized by the PSO algorithm and the WOA algorithm are better than the separate GRU model, and WOA-GRU is better than the PSO-GRU, which proves that the optimization algorithm can improve the predictive ability of the GRU model. The WOA algorithm is more suitable for GRU than the PSO algorithm.
(4) PCA and RF are used to perform feature selection on the input data to improve the predictive ability of the model, and the RF is more excellent. In general, the proposed OVMD-RF-IWOA-GRU model can be regarded as a competitive technology in improving the accuracy of flood prediction. Since flood forecasting is affected by many factors and sudden factors, the limited training data used in this article leads to limited model prediction accuracy, which needs to be combined with more relevant influencing factors for research and processing.