Real-time Flood Classification Forecasting Based on k-means++ Clustering and Neural Network

Floods are among the most dangerous disasters that affect human beings. Timely and accurate flood forecasting can effectively reduce losses to human life and property and improve the utilization of flood resources. In this study, a real-time flood classification and prediction method (RFC-P) was constructed based on factor analysis, the k-means++ clustering algorithm, SSE, a backpropagation neural network (BPNN) and the M-EIES model. Model parameters of different flood types were obtained to forecast floods. The RFC-P method was applied to the Jingle sub-basin in Shanxi Province. The results showed that the RFC-P method can be used for the real-time classification and prediction of floods. The parameters of the flood classification and prediction model were consistent with the characteristics of the flood events. Compared with the results of unclassified predictions, the Nash coefficient increased by 5%–11.62%, the relative error of the average flood peak was reduced by 6.08%–12.7%, the relative error of the average flood volume was reduced by 5.74%–8.07%, and the time difference of the average peak was reduced by 43%–66% based on the proposed approach. The methodology proposed in this study can be used to identify extreme flood events and provide scientific support for flood classification and prediction, flood control and disaster reduction in river basins, and the efficient utilization of water resources.


Introduction
Floods have always been among the most dangerous disasters affecting human health. In addition, with the increase in global temperatures and the intensification of human activities, the frequency of flood disasters is gradually increasing, thus posing a considerable threat to human life, property, and ecosystems (Merz et al. 2021).
Many efforts have been made to develop models that facilitate improved flood prediction. These models generally fall into two categories: physically-based models and datadriven models (Singh 2018). Physically based models mainly express the hydrological process in a basin through mathematical equations, and most of the model parameters have physical significance (Kauffeldt et al. 2016). Whether parameter settings are reasonable or not directly determines the final forecasting result (Benke et al. 2008;Gopalan et al. 2019;Wu et al. 2021). An optimization algorithm is used to obtain a set of model parameters, but the model parameters are based on the optimal values obtained from some historical flood events (Narsimlu et al. 2015;Reshma et al. 2018). Regardless of which optimization method is adopted, it is difficult to obtain a set of optimal parameters, which leads to inadequate prediction accuracy (Liang et al. 2021;Zhou et al. 2021).
Artificial intelligence approaches have developed rapidly in the past two decades, and they include methods such as machine learning, discovery techniques, and knowledge mining systems (Negnevitsky 2005;Wodecki 2019; Kliegr et al. 2021). Among them, machine learning methods, such as artificial neural network, decision tree, support vector machines, perform well in flood prediction, flood risk assessment and flood classification (Inyang et al. 2020;Keum et al. 2020;Mosavi et al. 2018;Munawar et al. 2021). In terms of hydrological forecasting, although machine learning methods can quickly produce predictions through a large amount of data training and have a better forecasting effect for various types of floods than do physical models, their internal structures and mechanisms are still unclear (Parisouj et al. 2020). In view of the above problems, physical models and machine learning can be combined to improve simulation accuracy (Young et al. 2017). However, machine learning is mostly used to analyze the prediction results of physical models or reduce the corresponding error (Wan et al. 2019). In flood classification, clustering algorithms and neural network methods have been widely used to explore the generation mechanisms of floods and assess different flood events (Sikorska et al. 2015;Stein et al. 2020), but few studies have combined this approach with flood prediction.
This study aims to formulate a real-time flood classification and prediction (RFC-P) method that combines the k-means++ clustering algorithm and a backpropagation neural network (BPNN) with the M-EIES model. RFC-P integrates the superiority of factor analysis, k-means++, the sum of the squared error (SSE), a BPNN and the M-EIES model into a framework that uses SSE to determine k values in the k-means++ clustering algorithm and applies clustering classification results to train the BPNN and overcome the problem that the k-means++ clustering algorithm needs to reclassify samples and recalculate the new clustering center after each adjustment; this approach avoids the generation of different classification categories and event classification results, thus promoting stable real-time flood classification. The M-EIES model parameters for the corresponding flood types are obtained by an intelligent optimization algorithm, and high-precision predictions are obtained for various flood types. This study combines the advantages of machine learning and physically based models to improve the accuracy of flood prediction. Then the method is applied to the Jingle sub-basin in Shanxi, China.

Case Study
The study region is the 2799 km 2 Jingle sub-basin (Fig. 1). The Jingle sub-basin is located in the upper reaches of the Fenhe River, a tributary of the Yellow River, Shanxi, China. This area in the middle latitudes has a semiarid and semihumid temperate continental monsoon climate, with a monthly average temperature of 4-13 °C and an average annual precipitation of 497.85 mm. The average annual maximum peak flood discharge is 594 m 3 /s, and the measured maximum peak flood discharge is 2230 m 3 /s. In addition, there are few coal mines, few water diversions project, and no water conservancy projects in the Jingle sub-basin. In recent years, urbanization has accelerated in the Jingle sub-basin, and the urban land area has expanded annually. However, the main land use types in this region are still woodlands and arable land (Fig. 1).
We obtained the flood data (from 1971 to 2018) from Jingle Hydrological Station from the Annual Hydrological Report of Yellow River. The daily rainfall data from 17 rain measurement stations (Fig. 1c) in Jingle sub-basin were unified to a 1 h time step for area-averaged rainfall, and evaporation values were calculated at the same scale; these data were obtained via the Thiessen polygon method.

Methods
The RFC-P framework (Fig. 2) combines real-time flood classification based on factor analysis, SSE, the k-means++ clustering algorithm and a BPNN with the M-EIES model.

RFC Method
The RFC method mainly includes three modules: data handling and processing, cluster analysis, and flood classification. The programming language used is Python 3.7, and the libraries used for preprocessing and managing our data are NumPy and pandas.
(a) Data handling and processing First, after considering and analyzing the factors that influence floods in the Jingle subbasin, the characteristic indices of the flood events were established based on eight variables: the duration of precipitation T, the maximum 1 h precipitation P 1h , the maximum 3 h precipitation P 3h , the mean precipitation intensity i, the total precipitation P, the preceding affected precipitation P p , the rising flow Q 0 , and the precipitation center D. These flood event index variables were easily obtained for the early rainfall period and were significantly related to the flood formation. The statistics for these indices are presented in Fig. 3a.
Second, the factor analysis method was used to reduce the dimension of the flood characteristic indices extracted from the flood process data for the basin. All the extracted characteristic indices were converted into several unrelated common factors, and the common factors for each flood were identified (Meredith 1993).

(b) Cluster analysis
The k-means++ algorithm was used to cluster the extracted common factors related to floods (Arthur and Vassilvitskii 2007;Likas et al. 2002;Nielsen and Nock 2013). This process (1) Take one center c 1 chosen uniformly at random from X.
(2) Take a new center Step (1) until k centers are obtained. (4) For each i∈{1,..., k} (Arkesteijn and Pande 2013) set the cluster C i as the set of points in X that are closer to c i than they are to c j for j ≠ i. (5) For each i∈{1,..., k}, c i is the center of mass of all points in C i : Steps (4) and (5) until C no longer changes. Obtaining a reasonable value of the flood classification number k is the key to ensuring a reasonable flood classification results. In this study, SSE was used to determine the optimal value of the flood classification number k: where d is the Euclidean distance between two sample points in the Euclidean space, k is the number of categories, x is the sample points, c i is the center point of each cluster, and S i represents the set of sample point in class i.

(c) Real-time classification
A BPNN is a multilayer feed-forward neural network that consists of an input layer, a hidden layer, and an output layer, and is trained according to an error back-propagation algorithm. A BPNN is usually used as a classifier with tutors (Benediktsson and Swain 1990;Heermann and Khazenie 1992). In this study, the first 80 floods in the historical flood dataset were selected as the training sample, and eight characteristic indices (T, P 1h , P 3h , P, i, P, P p , Q 0 , and D) of floods were used as the inputs of the BPNN. The number of clusters was used as the output of the BPNN, and the remaining information was used to verify the real-time classification performance of the BPNN.

M-EIES Model
The M-EIES model was developed by our team (Hu et al. 2005;Wen et al. 2020) based on the Xinanjiang model (Zhao 1992). The model comprehensively considers the complex mechanism of the combined action of superpermeability and runoff generation and simulates the flood process by analyzing the main runoff generation modes of different rainfall processes. The model demonstrates strong applicability for flood simulations with complex underlying surface conditions.
(a) Model structure The M-EIES model consists of five parts: evapotranspiration, watershed unit runoff, watershed unit confluence, underground runoff confluence, and watershed confluence. The M-EIES model organically combines the area distribution curve of the basin storage capacity and the area distribution curve of the basin infiltration capacity, and comprehensively considers the infiltration capacity, soil moisture content, and distribution. The distribution curve of the infiltration capacity in a given period is depicted as an m-degree parabola as follows: where β is the relative area, F � Δt is the minimum storage capacity (mm) at a certain point in the basin, m is the experience index, F � mΔt is the maximum infiltration capacity at the point (mm) within the time period t, and Δt represents the time period t.
According to the area distribution curve of the infiltration capacity, the average amount of infiltration water in the basin is F mΔt and is calculated as follows: The water storage capacity distribution curve of the basin is based on an n degree empirical parabola: where W ′ is the water storage capacity value at a certain point (mm), W ′ m is the maximum water storage capacity in the basin (mm), n is the experience index, and is the relative area.
The average storage capacity of the basin (W m ) was calculated according to the area distribution curve of the basin storage capacity: where W m is the average water storage capacity of the basin.
(b) Model parameters and calibration method There are 17 main parameters of the model, including 4 evapotranspiration parameters, 5 flow parameters, 4 confluence parameters and 4 water source parameters (Table 1). In this study, the manual trial and error method and SCE-UA algorithm (Song et al. 2012) were used to optimize and calibrate the model parameters.
(c) Model evaluation methods The Nash coefficient, relative error, and time difference for the flood peak were used to evaluate the optimal values of periodic parameters and the effect of the classified flood prediction. The NSE was used to verify the credibility of the model (Nash and Sutcliffe 1970) and can be expressed as follows: where Q obs m is the measured value of runoff, Q sim m is the simulated value of runoff, Q −obs m is the average value of measured runoff, and n is the sequence length. The value of the NSE is between 1 and ∞; the closer it is to 1, the better the performance and the higher the reliability of the model. A value close to 0 indicates that the simulation results are close to the average observed value and the overall results are credible, but the process simulation error is large in this case; a value less than 0 indicates that the result is not credible (Moriasi 2007).
The relative error is expressed as follows: The optimal RE value is 0. A positive value means that the simulated value is less than the observed value, whereas a negative value means that the simulated value is greater than the observed value. The greater the absolute value of RE is, the greater the simulation error of the model (Gupta et al. 1999). In this study, the relative error of flood peak discharge and relative error of the flood volume were used to predict performance.
There is a time lag between simulated and observed flood peaks: where t obs m represents the observed value of the occurrence time of the flood peak, t sim m represents the simulated value of this time, and E is the error between the two (Khu and Madsen 2005).

Flood Classification (a) Clustering classification
All events were classified into 5 categories by clustering (Fig. 4). Cluster 1 (C1) were long-rainfall floods. Cluster 2 (C2) were short-rainfall floods but with high preceding precipitation that contributed to the event (Fig. 5). Cluster 3 (C3) was an extreme flood with intensive rainfall and a long duration. Cluster 4 (C4) were short-rainfall floods. Cluster 5 (C5) were flash floods. The specific characteristics of various floods are shown in Fig. 3.

(b) Real-time classification
The real-time classification results of the BPNN were mostly consistent with the classification results of clustering analysis, and the classification accuracy reached 100% ( Table 2). The maximum error of the detection results was 0.083, and the relative error was 2.1%. Therefore, the BPNN correctly classified floods.

Classification of Flood Forecasting Results
First, the scope of parameter optimization was preliminarily set according to the physical significance of the parameters in the model. Then, combined with three automatic optimization algorithms, the parameters in each reservoir superpermeability model were optimized and adjusted individually using historical flood data and classified historical flood data, and the optimal value of each parameter was obtained (Table 3).
In C1, one event was used for the calibration parameters, and the other was used for verification. In C3, the single event was parameterized and simulated. Ten flood events were randomly selected from each other three types of flood clusters, of which seven were used for parameter calibration and three were used to verify the model. The results (Figs. 6 and 7) showed that compared with traditional flood prediction with a set of parameters, the RFC-P method using a multiparameter set can more accurately predict the peak arrival time of total runoff and the simulated flood process.
After classification, the average NSE of C1 increased by 8.8%, the average relative error for the flood peak was reduced by 7.68%, the average relative error for the flood volume was reduced by 8.07%, and the average peak occurrence time difference was reduced by 66%.
The average NSE after the classification of C2 increased by 5.6%, the average relative error of the flood peak was reduced by 9.15%, the average relative error of the flood volume was reduced by 5.74%, and the average peak occurrence time difference was reduced by 60%.
The average NSE after the classification of C3 increased by 11.6%, the average relative error of the flood peak was reduced by 12.7%, the average relative error of the flood volume was reduced by 6.82%, and the average peak occurrence time difference was reduced by 50%.
The average NSE after the classification of C4 increased by 10%, the average relative error of the flood peak was reduced by 8.94%, the average relative error of the flood volume was reduced by 6.68%, and the average peak occurrence time difference was reduced by 58%.
The average NSE after the classification of C5 increased by 5%, the average relative error of the flood peak was reduced by 6.08%, the average relative error of the flood Fig. 4 Flood clustering results.1-88 represents the number of flood events. C1, C2, C3, C4 and C5 indicate the flood type volume was reduced by 6.58%, and the average peak occurrence time difference was reduced by 43%.

Discussion
The results of the RFC-P approach indicate that the accuracy of all types of flood prediction improved compared to that achieved by the traditional methods, especially for extreme flood events. The main reason is that the traditional prediction method, which is based on historical flood events, uses an optimization method to calibrate a set of hydrological model parameters so that the model can adapt to most types of floods and achieve relatively good forecasting results. However, the causes of floods are complex, and the interactions among different factors vary under different climate conditions (Tarasova et al. 2019). Therefore, a set of parameters cannot reflect the characteristics of all types of floods, and it is difficult to achieve high prediction accuracy for all types of floods. Through flood classification, different flood types can be identified, and the model parameters of each flood type can be determined with an optimization method so that the model has multiple sets of parameters that reflect the characteristics of different flood processes, to achieve high-precision forecasting for various flood types. In particular, extreme flood events are usually associated with special climatic conditions, and the corresponding flood mechanisms and processes differ significantly from those Fig. 5 Classified flood event features. p 1h , the maximum 1 h precipitation; p 3h , the maximum 3 h precipitation; i, the mean precipitation intensity; P, the total precipitation; P p , the preceding affected precipitation; T, the duration of precipitation; Q 0 , the rising flow for other types of floods (Merz et al. 2021). Compared with the traditional forecasting method, the RFC-P method can capture the characteristics of extreme flood events in time through rainfall indices and preliminarily determine flood types in the real-time classification stage. Among the parameters of the RFC-P model, Wm reflects the degree of drought in the early stage; fc represents the stable infiltration rate; and the soil permeability coefficient k exhibits the lowest values in the extreme flood category, thus highly different from the traditional prediction parameters. Consequently, the consistency with actual extreme flood event processes is improved, and the prediction accuracy for the time of flood peak appearance is high. Therefore, this method provides an efficient and robust tool for real-time flood prediction, especially for extreme floods. However, the RFC-P method presented in this study has some limitations. First, the RFC classification method requires sufficiently long time series of historical flood event data to reflect the characteristics of all flood types in the basin, and this method is not suitable for areas with insufficient data. This problem can be solved by establishing the relationship between the characteristics of the underlying surface in the basin and the characteristics of floods, and by using data from areas with sufficient data that can be interpolated to areas without data. Second, a large amount of repeat data may add several hours to the training time of the BPNN in the RFC-P method. However, if the dataset is too small, although the run time may be only a few minutes, the generalization ability of BPNN in the RFC-P method may be poor. Therefore, it is better to establish a set of optimal dataset criteria to guide the use of this method. In terms of factor selection, both the number and the category of classification factors will affect the classification results. Restricted by the existing understanding of flood causes, classification factors reflect only some of the characteristics of the flood process, are unable to fully reflect the factors that influence the flood process and are unable to fully reflect the causal mechanisms of floods. In-depth studies of flood mechanisms and the selection of different numbers and categories of factors for classification experiments can promote the widespread use of this method and help determine the best classification factors.

Conclusions
In this study, the advantages of factor analysis, k-means++ clustering algorithm, SSE, the BPNN and the M-EIES model are integrated into a framework to develop a RFC-P method. The application of the RFC-P method in the Jingle sub-basin showed that compared with that of traditional prediction methods, the prediction accuracy for various types of floods (especially extreme floods) could be improved by classifying watershed flood events in real time and determining hydrological model parameters suitable for different types of floods. However, the method still needs to be validated in different catchments. The selection of the flood characteristic indices is the key step in this method and requires further analysis and discussion in the future. Additionally, the relationship between the underlying surface characteristics and flood characteristics is considered, so the method can be extended to areas with insufficient data.