Hydrologic Similarity Based on Width Function and Hypsometry: An Unsupervised Learning Approach

In ungauged or data-scarce watersheds, systematic analyses of a 6 set of proximate watersheds (for example, selected based on locational prox- 7 imity or similarity in climate, morphometry, lithology, soils, and vegetation) 8 have been shown to lend signiﬁcant insights regarding hydrologic response and 9 prediction. Current approaches often rely on: (a) statistical regression mod- 10 els that use measurable watershed attributes, such as area, slope, and stream 11 length; and (b) comparative hydrology that considers watershed characteris- 12 tics to assess hydrologic similarity to select analogous gauged watersheds as 13 proxies. Newer conceptions regarding hydrologic similarity focus on hydrologic 14 response and therefore emphasize the use of dynamical measures of the stream 15 network and watershed terrain. For example, the width function and hypso- 16 metric curve can be readily estimated using the available global digital terrain 17 datasets and represented as functional forms involving a small set of parame- 18 ters, thus achieving signiﬁcant data reduction. In this study, a new approach to 19 hydrological similarity in watersheds, one that utilizes these functional forms 20 to identify dynamically similar watersheds, is presented. Dissimilarity matri- 21 ces are created based on divergence measures, and watersheds are classiﬁed 22 using hierarchical clustering. The joint analysis of watershed width functions 23 and hypsometric curves allows for the classiﬁcation of watersheds into a re- 24 duced number of dynamically-similar groups. An illustrative case study for the 25 Narmada River, with 72 sub-watersheds, is presented. 26

In hydrological sciences, machine learning has been used in applications 21 such as precipitation analysis (Sun and Tang, 2020), rainfall-runoff processes 22 (Hsu et al., 1995;Minns and Hall, 1996;Dawson and Wilby, 1998;Abrahart 23 and See, 2000;Duan et al., 2020;Oppel and Mewes, 2020), ground water hy-  In this study, we propose an approach that employs unsupervised classifi-31 cation to group similar basins based on distribution properties of hydrological 32 basins. This provides a means for efficiently organizing a sea of data by sub-33 setting it into a smaller fraction of similar basins based on relevant physical 34 characteristics that can then be further analyzed at a finer detail. We used 35 the width function as a metric since it is a building block of the geomorpho-36 logical instantaneous unit hydrograph concept (Gupta and Waymire, 1983;37 Mesa and Mifflin, 1986;Bras, 1990), along with a hypsometric function to In what follows, we first briefly review some common approaches to similar-5 ity assessment. Next, we discuss the study area and the dataset used. We then 6 discuss the background information about hierarchical clustering, before pre-7 senting our methodology. Next we illustrate the results of the width function-8 and the hypsometric function-based clustering. 9 2 Background 10 2.1 Common approaches to hydrological similarity 11 Comparative hydrology is an approach to the prediction in ungauged basins 12 (PUB) that examines a large number of catchments to distinguish patterns of 13 hydrological behavior using common catchment and climatic characteristics.
14 While there is no universal basis for hydrological classification of catchments 15 (Blöschl et al., 2013), they are self-organizing systems whose hydraulic be-16 havior result from adaptive geomophological processes (Sivapalan, 2006) and 17 there are discernible patterns that form the foundations for understanding 18 their hydrological nature. In general, catchments can be considered hydrologi-19 cally similar if they have similar response to climatic variability (Blöschl et al.,20 2013). Proximity is a commonly used, reliable metric for determining similar 21 catchments, however this measure is limited in that it does not allow for the 22 use of catchments are not closer to each other (Patil and Stieglitz, 2012). Since 23 climate strongly impacts catchment characterisitcs and hydrological behavior, 24 the hydro-climatic region where a catchment is located provides another basis 25 for catchment classification (Budyko et al., 1974;L'vovich, 1979;Abrahams, 26 1984;Milly, 1994;Sankarasubramanian and Vogel, 2002;Woods, 2006;Yadav 27 et al., 2007). Similarly, readily observable spatial patterns in the catchment 28 structure that affect the temporal response can be used as signatures to de-29 termine possible co-evolution of basin dynamics (Blöschl et al., 2013), and can 30 be utilized to transfer hydrological information from data-rich catchments to 31 ungauged basins to predict physical phenomenon such as hydrologic response 32 (Burn and Boorman, 1993;Tung et al., 1997;Aryal et al., 2002;McIntyre et al., 33 2005;Wagener et al., 2007;Reichl et al., 2009;Archfield and Vogel, 2010;Oudin 34 et al., 2010;Patil andStieglitz, 2011, 2012;Razavi and Coulibaly, 2013;Athira 35 et al., 2016;Brunner et al., 2018). The mostly commonly used technique in-36 volves the transfer of lumped characteristics such as catchment shape and size, 37 Strahler ratios, drainage density, average slope, etc. that are used to explain hy-38 drogeomophological characteristics (Horton, 1932(Horton, , 1945Strahler, 1957;Bras, 39 1990;Rodríguez-Iturbe and Rinaldo, 2001). An issue with this is the possibil-40 ity of the loss of information in simplifying complex catchment properties into 41 a single number (Wooldridge and Kalma, 2001;Wagener and Wheater, 2006;4 Bajracharya, Jain Tetzlaff et al., 2009;Chang et al., 2014). Alternatively, distribution curves can 1 be used to assess hydrological similarity. Examples of this include the use of 2 the distribution of topographic index, height above nearest drainage, reduced 3 dissipation per unit length index (Loritz et al., 2019), the distribution of ripar-4 ian and hillslope effects on streams, the riparian-area change along the stream 5 network (McGlynn and Seibert, 2003), the hypsometric curve (Booij et al., 6 2007;Ssegane et al., 2012;Hailegeorgis et al., 2015;Bajracharya and Jain, 7 2021), and the width function (Moussa, 2008;Bajracharya and Jain, 2020). 8 Furthermore, various mathematical models that link catchment structure to 9 hydrological response based on underlying physics or statistical relationships 10 have been used to explore catchment similarity and to develop similarity pa-11 rameters (Hebson and Wood, 1982;Sivapalan et al., 1987;Larsen et al., 1994;12 Milly, 1994;Reggiani et al., 2000;Aryal et al., 2002;Woods, 2003). x represents the total distance along the flow path to the outlet (Veneziano 19 et al., 2000), termed here as the hydrological distance. As we do not distinguish 20 between the hillslope and channel network distance in this study, the width 21 function becomes synonymous with the area function. Under the assumption 22 of constant velocity, the width function represents the probability distribution 23 of travel times or the instantaneous unit hydrograph, reflecting the topological 24 features of a basin's stream response (Lashermes and Foufoula-Georgiou, 2007;25 Moussa, 2008). The width function is strongly linked to the peak and shape 26 of the hydrograph (Kirkby, 1976;Gupta and Waymire, 1983;Troutman and27 Karlinger, 1984, 1989).

28
The width function is most commonly represented by a histogram with 29 the hydrological distance in the x-axis and the frequency or density of the 30 areal extent of streams in the y-axis (Figure 1). Bajracharya and Jain (2020) 31 demonstrated the use of a truncated skew-Normal (SN ) mixture model to 32 analytically represent the width function with the x-axis normalized by scal-33 ing between 0 and 1, and demonstrated its utility in finding hydrologically 34 analogous drainage basins using divergence measures such as the L 2 distance 35 (Tsybakov, 2008). The SN distribution is a three-parameter probability dis-36 tribution formed by adding a skewness element to the Normal distribution.

37
For a continuous random variable, X, the SN distribution is represented as: where φ(x) denotes the standard Normal density function of x, Φ(x) denotes 1 the cumulative distribution function (cdf ) of the standard Normal, and ξ, α, 2 and ω are the location, scale, and shape parameters, respectively. The domain 3 of the SN distribution is then truncated to [0, 1] using a correcting factor to 4 guarantee the validity of the normalization condition (Thomopoulos, 2017): where F (x) denotes the cumulative density function. Finally, a finite mixture 6 model of n truncated SN distributions is represented as: where w i denote the non-negative mixing proportions that sum to one. Fur-8 thermore, the L 2 distance used by Bajracharya and Jain (2020) to measure 9 similarity between two width functions is computed as: where N 1 and N 2 represent the two width functions. A value of zero indicates 11 identical width functions, while larger values reflect a larger difference. The upper x-axis shows the hydrological distance in absolute units (km), while the lower x-axis presents the corresponding scaled hydrological distance.
the geomorphic maturity of catchments, with a concave up shape indicating 1 relatively mature basins with a high degree of erosive activity, and a con-2 cave down shape indicating relatively young basins with a large proportion of 3 uneroded topography or creep-dominated hillslopes (Strahler, 1952;Moglen 4 and Bras, 1995;Pedrera et al., 2009;Willgoose, 2018). Furthermore, studies 5 have linked the hypsometric curve with various drainage basin features such 6 as the hydrograph time-to-peak, head-ward drainage development, regional where β, z, and m denote the three parameters. Furthermore, Bajracharya 5 and Jain (2021) illustrated the use of hypsometry to find analogous basins 6 using the discordance index (DI), defined as the total absolute area between 7 two hypsometric curves. The Narmada River basin (NRB) is located in central India between latitudes 11 21°22' 0" N and 23°46' 30" N, and longitudes 73°4' 0" E and 81°45' 30" E.

12
The drainage area is 95, 000 km 2 ( Figure 3). The elevation ranges from nearly  The elevation data for the region was obtained from GTOPO30, a global 1 digital elevation model (DEM) developed by the United States Geological 2 Survey (USGS). It was derived from several raster and vector sources of to-3 pographic information (USGS, 1996). The dataset has a spatial resolution of 4 30-arc seconds and a vertical accuracy of around 30 m. It is based on several 5 sources of elevation information, including various vector and raster datasets, 6 merged together, with a priority given to the data with a greater topographic 7 detail and accuracy. With extensive accuracy checks, GTOPO30 data are suit-8 able for numerous regional and continental applications, including the extrac-9 tion of drainage features for hydrologic modeling (USGS, 1996). The gap statistic was used to determine the optimal number of clusters 31 (Tibshirani et al., 2001). For a dataset with k clusters based on distance mea-32 sure d, the gap statistic is defined as where E * n represents the expected value for a sample size of n from the reference 2 distribution and W k is the pooled within-cluster sum of squares around the 3 cluster means, defined as W k = k r=1 1 2nr D r . This statistic measures the 4 deviation of the observed W k from its expected value under the null hypothesis.

5
The optimal number of clusters,k, can be chosen based on various algorithms, 6 including global maximum method, which maximizes Gap n (k), signifying the 7 farthest deviation from uniform points distribution. Due to the lack of clear 8 group demarcations in both width function and hypsometric curve shapes, we 9 chosek based on local maxima, where the increase in Gap n (k) first tails off.

10
There is a level of subjectivity in the choice of the number of clusters, with 11 more groups leading to more homogeneity within the group members but a 12 smaller number of members per group. 13 We also demonstrated the process of outlier detection to reduce intra- First, the optimal number of clusters was determined using the gap statistic.  can mistakenly classify small clusters as outliers and remove valuable informa-6 tion from the data. Thus, outlier detection involves a degree of subjectivity.

7
Here we use a simple algorithm to analyze, detect, and remove outliers based 8 on similarity measures with nearest neighbors. Figure 6 shows the L 2 distance 9 to fifteen closest neighbors for each width function. Based on this measure, et al., 2021), we employed this basic outlier detection algorithm as a proof of 1 concept, one that is easy to understand and can be readily applied. Hierarchical clustering can be best denoted using dendrograms. The den-    Fig. 9 Determination of the optimal number of hypsometric function clusters using gap statistic. The optimal number of clusters was chosen based on the change in the rate of increment of the gap statistic. lowering the number of nearest neighbors being considered to just one leads 10 to no member being classified as outliers. This matches visual inspection since 11 the intra-cluster variance in each group is already low. As a result of this, no 12 outlier was removed.

13
Cluster dendograms are shown in Figure 11 (a), along with the mean hypso-14 metric curves for each cluster group (Figure 11   sometric clustering to represent watershed analogs that take into account both, 1 the planar stream network geometry as well as the elevational characteristics 2 of the basin (Figure 12). This provides a framework for bivariate clustering 3 that incorporates multiple metrics that supplement each other. For instance, 4 sub-basins 14, 21, 30, 33, 35, and 60 fall in hypsometric cluster 4 and width 5 function cluster 5, with these members indicating mildly mature hypsome-6 try and width functions with the peak considerably skewed to the right. As 7 such, these sub-basins could potentially be analogues with similar hydrologi-8 cal response properties. Sub-basins 17 and 53 have concave-down hypsometric 9 curves (hypsometric cluster 7), but have considerably different width function 10 shapes (width function clusters 2 and 5), indicating that the hydrological re-11 sponse behaviour of these two sub-basins might be considerably different. As which results in a fuller description of basin processes. In Figure 12, we explore 1 the spatial relationships between members in the bivariate groups. Group 5-2, 2 with an early width function peak and a relatively linear hypsometric curve, 3 is predominantly formed at the upstream region of the watershed. Group 4-5, 4 with a highly steep falling limb of the width function and a relatively linear 5 hypsometric curve, exhibited relatively smaller accumulation areas. However, 6 in general, the spatial relationship within the highlighted bivariate groups was 7 found to be weak. for the partitioning of watersheds into groups with consistent functional forms. 16 We proposed a four-step approach for forming hydrologically similar analogues.       Figure S1 shows the width function clusters before the removal of outliers.

5
Clusters 1, 5, and 6 have higher peaks in the right SN component while 6 cluster 3 has a higher peak in the left SN component, potentially indicative of 7 different location of peak flows in hydrographs. Furthermore, the high slopes 8 on right sides of the curves for clusters 2 and 6 could be indicative of more 9 rapidly falling recession limbs of hydrographs.