Regionalization of Rainfall Intensity-Duration-Frequency (IDF) Curves With L-Moments Method Using Neural Gas Networks

Floods are one of the most frequent and destructive natural events which lead to lots of human and financial losses with damage to the houses, farms, roads, and other buildings. Intensity-duration-frequency (IDF) curves are the main and practical tools that have been used for flood control studies including the design of the water structures. In many cases, there is not any measuring device at the desired place or their information are not useful if there is any available. In this case, it is not possible to extract these curves through the conventional methods. Regionalizing the IDF curves is a method that has solved the issues mentioned in the common methods. In this research, the regionalized IDF curves are extracted in Khozestan province, Iran using 21 rain gauge stations through L-moments and neural gas networks. Clustering is one of the most effective steps and a prerequisite for regional frequency analysis (RFA) that divides the region and existing stations into hydrologically homogenous regions. In this study, clustering is done using two new models named neural gas (NG) and growing neural gas (GNG) network. Comparing the regional IDF curves with single site curves, it was found that neural gas network models had a more accurate performance and higher efficiency so that they had the lowest estimate error amount among other models. Also, due to the acceptable difference between regional and single site curves, the efficiency of L-Moments in RFA was evaluated as appropriate.


Introduction 1
Intensity-duration-frequency (IDF) curve is one the most common tools in water resources 2 engineering which can be used as an input in planning and designing, and exploitation of water 3 resources projects. One of the common problems in many countries is the scattered or very 4 weak networks of the required meteorological stations such that their data are considered as 5 main bases for IDF construction. To this problem, a regional analysis of rainfall depth and 6 building the IDF curves has been proposed. 7 The IDF concept refers to the Bernard's efforts in 1932 and a lot of the studies focused on 8 improving the statistical inference methods used in IDF (Bell 1969). One of the noticeable 9 researches in this field is Hasking and Wallis' study (1997) on developing a method for L-10 moments estimation, probability-weighted moments (PWM) (Greenwood et al. 1979), 11 parametric formulation of IDF relations (Koutsoyiannis et al. 1998) and employing the 12 regional methods like the Index-Flood method. 13 Today Atlas of IDFs has been built in developed countries. One of the works is the National 14 Oceanic and Atmospheric Administration (NOAA) atlas 14 which was created by American 15 National Weather Services at ( Perica et al. 2013). 16 Regional analysis uses the group statistics and characteristics from co-behavioral stations 17 instead of using data only from one station. Several studies using regional methods on the 18 extreme rainfalls suggest that these techniques increasingly reduce the doubts on the estimates 19 resulted from the at-site view of point (Lee et al. 2003)One of the main problems to expand 20 frequency analysis results from one or more stations to one region is the hydrological lack of 21 homogeneity of the region. Despite the suitability of cluster analysis for grouping the 22 hydrological features, homogeneity of the regions is not completely achieved. So it is 23 recommended to examine and test the cluster analysis results with the other coventional 24 methods (Rousseeuw 1987) 25 Soltani et al. (2017) using the characteristics of rainfall time scale and three variables of 26 average daily rainfall intensity, the standard deviation of daily rainfall intensity, and scale 27 index, drew the regional IDF curves for Khozestan province and the absolute error of estimates 28 for this method was mainly beelow 25% and confirmed the results were acceptable. 29 Using topographic and rainfall characteristics, Alemaw et al. The IDF Regionalizion is very beneficial in terms of shortening the steps and required time 37 to perform the calculations as well as providing it for the area and not only for the station. 38 Identifying the homogeneous regions is usually the most crucial and difficult step and a 39 prerequisite for the frequency analysis hypotheses between the hydrologic frequency analysis 40 stages of the region. In this study, a method based on the neural gas and growing neural gas 41 networks is presented to cluster the hydrological data and determine the homogeneous regions. 42 The neural gas network is one of the types of competitive neural networks and uses an 43 unsupervised teaching method. The network was first introduced by Martinez and Schulten 44 (1991). One of the features of this algorithm is learning the topology or distribution shape of 45 the data space. One of the issues with this algorithm is that it starts working with the several 46 elements which makes the algorithm too slow at first. This problem was solved four years later 47 4-Determine the appropriate statistical distribution for each region and estimate the 115 distribution parameters at the required duration. 116 5-Investigate the quantiles or amounts of precipitation in duration and the required return 117 periods. 118 6-Draw the regional IDF curves. 119

Neural Gas (NG) network 120
The rule of learning in the neural gas network is as follows: 121 Where is a gas molecule formed on data space. The number of these molecules is initially 124 assumed as a value and eventually, it is revised to have the logical and optimal function of the 125 algorithm. These elements also have been selected in the main data range. is a parameter 126 that specifies the learning rate and depends on and λ. as if λ tends to infinity, learning of the 127 whole neurons would be equal and if it tends to zero, then the nearest neuron begins to learn. 128 The extreme modes of λ are not suitable alone and usually, a mode between them is chosen. 129 refers to the superior neuron to the i neuron. Ɛ is also a constant number that controls the 130 learning rate. 131 To create a neighborhood between the first and second neurons in terms of proximity, an 132 edge is created. For each neuron there is . {0.1} which shows that there exists an edge or 133 neighborhood or doesn't exist and also . {0.1.2 … . } which shows the time intervals (age) 134 from the last meeting or re-edge, that if it exceeds more than one size, the neighborhood will 135 be broken. This approach helps the neural network to learn topology. 136 NG algorithm can be summarized as follows: 137 Step 1: A random position of is created in the data space. 138 Step 2: An input named x is selected from the expected data. 139 Step 3: Aging, which includes computation of the distance between x, and the centers of , 140 and aging for each center. 141 Step 4: Adaption or learning. 142 The main point is that during the training period, as the algorithm progresses, the learning 144 speed should be reduced, otherwise the neural network will be repeated and an incorrect cycle 145 will be created. For this purpose, the amount ofλ and should be decreases as learning 146 progresses. So, the following function would be used. 147 Where i index shows the parameter value at the beginning of learning and f index shows the 150 value of the parameter at the end of the learning process. For instance, if t=0, so ( ) = , 151 and if = , so ( ) = 152 Step 5: An edge between the first two ranks in terms of proximity and age of this edge is 153 considered equal to zero (create a neighborhood). 154 Step 6: Age of all edges increases ) . → . + 1( 155 Step 7: It is assumed that = 0 and for each j which is . > , it is considered as . = 0. 156 at this step for the reasons mentioned in step 4, T should be increased during the learning period 157 to reduce the degree of rigidity, which means the edges are allowed to last longerStep 8: If the 158 termination conditions are not met (for example, the maximum quantity of neurons or any 159 amount of performance) the step2 is repeated. Otherwise, algorithm steps would be finished. 160 161 3.2. Growing neural gas (GNG) network 162

163
GNG algorithm, which is based on the unsupervised artificial neural networks, was first 164 introduced by Fritzk (1995). The GNG network is a clustering algorithm that is working step 165 by step, it means that the number of neurons is increasing without using a previous knowledge 166 about the structure of input patterns during the learning process (Fink et al. 2015). Unlike 167 classical clustering algorithms, the GNG algorithm owns a compatible network structure which 168 makes it suitable for learning the large data set topologies (Zaki and Yin 2008). The main idea 169 of GNG is that it will continuously add the new nodes (neurons) to a small initial network in a 170 growing structure. In the GNG network, the neurons compete to determine which one is most 171 similar to the input data set (Morell et al. 2014). 172 GNG algorithm can be summarized as given below: 173 Step1: Creating two random neurons at locations 1 and 2 174 Step 2: Select vector input called x 175 Step 3: Finding the best neuron ( 1 ) and second-best neuron ( 2 ) 176 Step 4: Increasing age of all edges connected to 1 177 ∀ ∶ 1 ← 1 + 1 178 Step 5: Increase the amount of accumulated error in 1 179
1 2 = 0 183 Step 8: All edges that their age is more than T, will be deleted. 184 Step 9: If the number of inputs presented to the network is an integer multiplier of L, a new 186 neuron is created. This neuron is created at the location of . 187 q is the neural index which has the most amount of accumulated error, f is the neighbor 188 index of q which has the most errors. 189 and errors with α coefficient are declined: 190 Consider error equals to . = 192 Step 10: Decreasing the accumulated error of all neurons. ← < 1 193 Step 11: If the stop measurement (for example maximum number of neurons or any scale of 194 performance) has not yet been met, step1 would be repeated. 195

Discordancy and heterogeneity measures 196 197
An area containing N stations is considered so that the i station has the record lenght of ni 198 and the ratio of L-moments ( ) . 3 ( ) and 4 ( ) . In this case, the discordancy criterion would be 199 calculated using the below relations. 200 ] is the L-Moment ratio Matrix in station i, N is the number of 201 stations and S is the sample covariance matrix. 202 If is big, the location i is discordant. An appropriate criterion to determine if a station is 203 discordant or not is that is bigger than 3 or equal to it. 204 To calculate the degree of heterogeneity first V 1 would be obtained using equation 14 for 205 the observed data. 206 Where is the size of samples in the station i, t (i) is the sample L-moment (L-CV), ̅ is the 207 point average of sample moment (L-CV). 208 For each simulated area, V 1 would be calculated. Also from simulated data, average µ and 209 standard deviation and inhomogeneity criterion would be determined through relation 16. 210 Husking and Wallis (1991) suggested that an area can be an acceptable homogenous area if 211 H i is smaller than 1 and it can be relatively hetrogenous if H i is between 1 and 2 and it would 212 be definitely hetrogenous if H i is bigger than 2. In practice, the H1 criterion is more appropriate 213 (Rao and Srinivas 2006). 214 215

Selecting the appropriate distribution 216
Selecting an appropriate frequency distribution for homogeneous regions can be done by 217 comparing the distribution moments with the average regional moment of the data. Also, to 218 select the best distribution, a goodness of fit test will be performed for the distribution function. 219 This test would be done through calculation statistics of . An appropriate distribution 220 function is a function which is | | < 1.64 221 Here "dist" means distribution, τ 4 Dist is the size or distribution kurtosis criterion (LCK), τ ̅ 4 is 223 the areal average of L-moment sample kurtosis , β 4 the area bias value of the above moment, 224 4 is the regional deviation of the above moment, and N sim is the number of simulated areas. In equation 21, X T is a set of dimensionless regional quantile with a probability of not 232 exceeding f which is called the regional growth curve. μ j is the scale factor for station i, that 233 parameters such as mean or median are considered to simplify the calculations. 234 The value of variation coefficient moment and ratios of L-moments for the station j using 235 single site data, x j , is equal to their amounts for regional data. As a result, it will be possible to 236 estimate the regional quantiles X, by equating the first to fourth moments of the region with 237 the mean, the coefficient of variation moments, and the L-moments ratios of the distribution 238 function considered for the region. 239 By estimating the set of quantiles X, for the maximum annual series of rainfall intensities in 240 each duration and the desired return period in a homogeneous region, along with estimating 241 the scale factor , for only one station in the region, different values (i, d, T) using equations 242 (20) and (21) will be computable. So it is not needed to estimate the probability of distribution 243 function for every single annual series in each station. Finally, using these values, a regional 244 IDF curve will be drawn for each homogeneous region. 245 To investigate the differences between the regional IDF curves which are based on the Used data sets should be normalized before entering the clustering models. This is due to 260 the existence of data from the different types such as geographical data as well as precipitation 261 data, which also has the different units. Based on these normalized data, probabilistic 262 homogeneous regions were determined using clustering models including the new method of 263 neural gas networks and the common models of Ward, K-means, Self-Organizing Map, and 264

Regional homogeneity tests 283
To investigate the regional homogeneity and discordancy of the stations in each region, the 284 H and Di statistics, which are tests based on L-moments, were used, respectively. These values 285 were determined for the maximum intensity of annual rainfall at the different duration as well 286 as the various models used for clustering. These results are presented for 24-hour rainfall in 287 table 3. Referring to the results, in none of the applied clustering models in different time 288 duration, there was not any discordant station and except for a few cases, all of the regions 289 formed in different models were homogeneous, which indicates the appropriate accuracy of the 290 clustering models used. 291 According to the results of the goodness of fit test, the generalized logistics 292 distribution(GLOG) and GEV distribution are selected as the regional distribution function for 293 regions 1 and 2, respectively. To estimate the required quantiles, it is necessary to calculate the 294 parameters of regional distribution. For this purpose, in all of the used clustering models, the 295 first to fourth moments of both generated regions are considered equal to the first to fourth 296 moments of distribution function considered for the region. The results for the neural gas 297 network model(NG) are presented as an instance in table 4. 298 To determine the best clustering model, the numerical values of regional curves with the 299 same values in stationary curves were compared and the results are presented in table 5. As can 300 be seen from table 5, according to the calculated estimate error values, the neural gas network 301 clustering model has the lowest error amount in both indicators, which shows the superiority 302 of this method over the other methods. Also, the negative MBE index for this model indicates 303 that the numerical values of the regional IDF curves obtained from this method are somewhat 304 larger than the at-station values (overestimation). Also by considering all of the three 305 indicators, the growing neural gas network can be considered as the second suitable model. By 306 considering the error values between the regional IDF with at-station IDF which are presented 307 in table 6, it is concluded that in stations with the short record length, the estimate error amount 308 has been increased, and if this station would be removed, the better performance can be expected from 309 the used clustering models The comparison between the two types of regional and at-site curves in the 310 four selected stations is shown in figure 9. 311

Conclusions 312
In this study, two new models of neural gas and growing neural gas networks were presented 313 to regionalize the IDF curves. For this purpose, taking into account the characteristics of 314 longitude, latitude, average annual rainfall, altitude, and maximum 24-hour annual rainfall for 315 each station, and using 3 indicators of and CS, Silhouette, and Calinski-Harabasz(CH), it has 316 been determined that Khozestan province has two separate and possibly homogenous regions. 317 Then, using different clustering models, the homogeneous regions have been formed. 318 Clustering was one of the most important and main steps of this research due to the associated 319 sensitivity and great impact on the final result. Therefore, clustering operations were performed 320 using 6 different models, Ward, K-means, FCM, Self-Organizing Map (SOM), that are among 321 the most widely used methods. In addition to the four methods mentioned, two new models of 322 the neural gas network (NG) and growing neural gas network (GNG) were used for clustering. 323 To investigate the homogeneity of the two regions, as well as the discordancy of the stations 324 in each region, in all of the 6 models in eleven durations, the regional homogeneity tests and 325 discordancy tests based on L-moments were used. In most models and different durations, the 326 regions created by clustering had a good homogeneity. After determining the position of each 327 station in the dual regions and identifying both areas as homogeneous, the regional distribution 328 function was determined and then the regional IDFs were extracted using the L-moments 329 method. The regional IDF curve obtained for each area was compared with the at-station IDF 330 curves in the same area. The results showed that in all of the stations, the regional and stationary 331 IDF curves are highly consistent and show the same trend. This research is the first one to 332 evaluate the efficiency of neural gas networks in regionalizing the IDF curves. Among 333 extracted regional IDFs, the curves obtained from the region composed of neural gas networks 334 and growing neural gas network models had the highest accuracy and the most compliance 335 with the at-station curves, which indicates the efficiency of these models in terms of 336 regionalization. The quality of operation of neural gas networks can improve the various issues 337 and problems related to the water resources management and planning. 338

Acknowledgement 339
The authors appreciate the constructive comments of anonymous reviewers on this paper, 340 which helped improve the final version of the paper.

Ethics approval 355
We confirm that this article is an original research and has not been published or presented 356 previously in any journal or conference in any language. 357

Consent to participate 358
Not applicable. 359

Consent for publication 360
All the authors consented to publish the paper. 361                    Rainfall intensity(mm/hr)

Lali station
At-site Regional 2 5 10 20 50 100    Table 5. Average values of goodness-of-fit indices of the difference between IDF curves based on regional and at-site probability distributions for the used clustering models  Table 6. Values for goodness-of-fit indices of the difference between IDF curves based on regional and at-site probability distributions at 21 rainfall stations for the NG clustering model Values of goodness-of-fit indices Stations ∆ MBE CV