Importance of regional rainfall data in homogeneous clustering of data-sparse areas: a study in the upper Brahmaputra valley region

Delineation of homogeneous regions has found its way into many hydrological applications as it helps in addressing the challenges in understanding the behavior of rainfall distribution and its variability at a local scale. In the present study, rainfall data recoded by 83 tea gardens in the upper Brahmaputra valley region of Assam have been used to identify homogeneous rainfall regions by using fuzzy clustering analysis. Furthermore, seven different cluster validity indices (CVs) were utilized to find out the optimum clustering in the fuzzy c-means (FCM) algorithm. The clusters thus formed were assessed for statistical homogeneity by performing homogeneity tests based on L-moment. Three different combinations of feature vectors were employed in FCM algorithm and the outputs were compared for attaining best solutions to regionalization. The results were further compared with previous regionalization studies. The analysis and comparison conclude that if regionalization needs to be done at a local scale, further sub-clustering of a larger clustered region to smaller regions may be required. Local rainfall data can be used for the purpose provided a good dataset with large number of station points are available within the region. Along with rainfall data, geographical location parameters (latitude, longitude, and elevation) need to be taken into account for getting a definite conclusion.


Introduction
Rainfall is one of the most important hydrological parameters that requires to be studied scrupulously, both in spatial and temporal scale. However, challenge comes in understanding the behavior of rainfall distribution pattern when applied on a regional scale. Pattern of rainfall, its frequency, and magnitude may drastically vary depending upon the orography (Venkatesh and Jose 2007), large-scale synoptic and convective precipitation types (Karaca et al. 2000;Unal et al. 2012;Baltacı et al. 2015;Efe et al. 2019) of the region. Scarcity of sufficient data at many sites of interest may further complicate the investigation. To address this issue, the region may be classified into few homogeneous rainfall regions of similar rainfall distribution, also termed as regionalization. The Brahmaputra valley region is situated in the northeastern part of India that has its own peculiar topography. The variations in the orographic arrangements and altitude differences in the region give rise to irregular and complex rainfall patterns at a local scale, which eventually amplifies the need of regionalization.
Regionalization has found its way into many applications in water resources planning, agricultural planning, drainage design, and estimating magnitude and frequency of extreme events like flood and drought. Literatures indicate that, in the past, political and geographical boundaries are used as a basis of forming homogeneous regions (Thomas and Benson 1970;NERC 1975;Beable and McKercher 1982;Heiler and Hong 1987). However, use of political and geographical boundaries is found to be not very convincing while forming hydrologically homogeneous regions (Bonell and Sumner 1992;Burn 1997;Rao and Srinivas 2006b;Satyanarayana and Srinivas 2011;Dikbas et al. 2012). A considerable amount of researches have been carried out in the recent years to identify homogeneous rainfall regions with various methods other than geographical divisions (Bedi and Bindra 1980;Bärring 1987;Sumner and Bonell 1988;Kulkarni et al. 1992;Gadgil et al. 1993;Burn 1997;Adelekan 1998). Regionalization with the use of principal component analysis (PCA) was found to be of use (Singh and Singh 1996;Wotling et al. 2000). When subjectivity involved with PCA came into notice, the concept of cluster analysis started getting attention (Bonell and Sumner 1992;Guttman 1993;Venkatesh and Jose 2007;Machiwal et al. 2019). Cluster analysis refers to a varied group of statistical procedures used to classify a multivariate dataset into some clusters or groups Srinivas 2006a, 2006b;Srinivas et al. 2008;Dikbas et al. 2012). Studies have also been done where PCA was further associated with various cluster analysis techniques for homogeneous clustering (Dinpashoh et al. 2004;Satyanarayana and Srinivas 2011;Darand and Daneshvar 2014;Machiwal et al. 2019). Ward's hierarchical cluster analysis is one of the widely used methods that is found to be suitable for homogeneous regionalization (Unal et al. 2003;Baltacı et al. 2017). Another vastly applied method is the K-means clustering (Rao and Srinivas 2006a;Pelczer et al. 2007;Agarwal et al. 2016). K-means clustering splits a region into hard clusters, i.e., with a degree of membership 1 or 0. This means that a site can at most belong to only one cluster. However, this may not be valid in real-world cases. To address this matter, the concept of fuzzy clustering was introduced to regionalization, which allows a site to fit into several clusters concurrently with a certain membership value. The fuzzy membership value of a site signifies the extent to which it fits into a particular group of sites (Rao and Srinivas 2006b). A lot of studies have successfully applied fuzzy clustering technique for clustering of hydrologically homogeneous regions in the recent years (Hall and Minns 1999;Owen et al. 2006;Plain et al. 2008;Sadri and Burn 2011;Satyanarayana and Srinivas 2011;Chen et al. 2011;Dikbas et al. 2012;Mok et al. 2012;Chavoshi et al. 2013;Asong et al. 2015;Bharath and Srinivas 2015;Goyal and Sharma 2016;Irwin et al. 2017;Wang et al. 2017).
On the basis of critical reviews of earlier studies on regionalization, the aim of the current study is defined as to identify homogeneous rainfall regions in upper Brahmaputra valley region of northeast India by using fuzzy clustering analysis. To achieve the best possible partition from the fuzzy c-means (FCM) algorithm, seven different cluster validity indices (CVs) were used. Three different combinations of feature vectors were employed in FCM algorithm and the outputs were compared for attaining best solutions to regionalization. The homogeneous rainfall regions (fuzzy clusters) thus formed by the use of FCM algorithm and validated with CVs were then assessed for statistical homogeneity by performing homogeneity tests using L-moment approach (Hosking and Wallis 1997). The results were further compared with some previous regionalization studies and finally concluding remarks were presented.

Methodology
In the subsections below, the fuzzy c-means (FCM) algorithm for delineation of homogeneous clusters is described at first, which is followed by a brief description of the CVs, homogeneity test methods, and process of adjustment of heterogeneous clusters. The steps undertaken to assure data quality is explained thereafter. The methodology used is shown in Fig.  1, by means of a flowchart.

Fuzzy C-means clustering
The FCM approach is basically optimization of fuzzy c-means objective function. It was initially developed by Dunn (1973) and afterwards modified by Bezdek et al. (1984). The fuzzy cmeans function to be minimized is expressed as: , v i ∈ R n is vector of cluster centers to be determined; Þ i s a squared distance norm; and m is the fuzziness parameter or fuzzifier, where m = [1, ∞). Usually, its value falls between 1 and 2.5 (Pal and Bezdek 1995).

FCM algorithm for delineation of homogeneous rainfall regions
(i) The initial fuzzy partition matrix U is set.
(ii) Then, initial membership values μ init ki of x i that belongs to cluster k is adjusted by using equation: (iii) Fuzzy cluster centroid v k is then calculated as (iv) Fuzzy membership value μ ki is updated as (v) The objective function is then calculated as The above steps from (iii) to (v) are repeated until the difference in the objective function for two consecutive iterations becomes adequately small.

Cluster validity indices
FCM algorithm divides the data into well separated and compact clusters, provided the optimal values of c and m. Hence, deciding the optimal values of these parameters is very crucial. Bezdek (1981) addressed this matter by stating the concept of validity indices. These indices essentially measure the goodness of the partitioned clusters. In hydrological studies, several indices are used (Hall and Minns 1999;Srinivas et al. 2008). In case of FCM algorithm, the following CVs are found to perform well: (2) Fuzzy partition entropy (V PE ) (3) Fuzziness performance index (V FPI ) (4) Normalized classification entropy (V NCE ) Bezdek (1974a, 1974b) formulated V PC and V PE , whereas V FPI and V NCE were proposed by Roubens (1982). The range for V PC is [1/c, 1]; V PC =1/c indicates equal sharing of clusters, i.e., equal membership values of a data in all clusters (i.e., μ ki =1/c ∀i,k) and V PC = 1 indicates no sharing of membership among the clusters. Similarly, the range of V PE is [0, log (c)]. V PE =0 implies no sharing of membership among clusters and V PE =log(c) implies equal sharing of clusters (i.e., μ ki = 1/c ∀ i, k). On the contrary, this range is [0, 1] for V FPI and V NCE ; 0 implies no membership sharing between clusters and 1 implies equal sharing of clusters (i.e., μ ki = 1/c ∀ i, k). As such, a maximum value for V PC indicates optimum partition (i.e., minimum value for V PE , V FPI , and V NCE ) that means least overlap among clusters. Earlier studies have stated that these four CVs tend to display monotonous increasing or decreasing trend (Rao and Srinivas 2006b;Srinivas et al. 2008). Hence, they are not very effective in obtaining optimum partition to delineate rainfall regions. Xie and Beni (1991) found no direct correlation of V PC and V PE with any property of the data. Furthermore, they are found to be very sensitive to the fuzzifier value, m (Halkidi et al. 2001). Here, these indices are used mainly to validate their performances in detecting optimum cluster number. To eliminate this drawback, the other validity indices are introduced, as explained below: (5) Extended Xie and Beni index (V XB ) Xie and Beni (1991) proposed the cluster validity index V XB,m . It quantifies the ratio of compactness within a fuzzy cluster to separation of clusters. Optimal partition of clusters should exhibit minimum value of V XB,m .
(6) Fukuyama and Sugeno index (V FS ) Proposed by Fukuyama and Sugeno (1989), optimum partition is indicated by a minimum value of V FS .
The extended Xie and Beni index (V XB ) exhibits monotonous decreasing tendency as c→N. To address this problem, Kwon (1998) proposed another index V K that has an ad hoc punishing function in numerator.
To determine the optimal values of c and m, a range of values for the two parameters are selected and subsequent partitioning results show different sets of clusters, along with their validity indices. To decide the optimal set of values for c and m among those sets, first the optimum selection criteria of each of the validity indices are examined. Then, the sites having greater membership value in the clusters are identified, based on a threshold value of fuzzy membership (T i ). Thus, a fuzzy cluster is made by allocating those sites to the cluster, whose membership values are found to be higher than or equal to the threshold fuzzy membership value (T i ). The selection of this threshold value is subjective (Satyanarayana and Srinivas 2011). The most reasonable explanation would be to allocate the site to that group where its membership value is the highest. Yet, uncertainty arises when a site has low membership value in all the clusters or has equal memberships. To address this issue, homogeneity test is done which is followed by adjustment of heterogeneous clusters. The methodologies for both of these are explained in the following subsection.

Homogeneity test and adjustment of heterogeneous clusters
The fuzzy clusters thus formed by using FCM algorithm and validated with CVs are then required to be assessed for statistical homogeneity by performing homogeneity tests. Heterogeneity measure (L-moment based) proposed by Hosking and Wallis performs better when skewness is low (average L-skew < 0.23) for a sample set of data, while for higher skewness, bootstrap Anderson-Darling test is recommended (Viglione et al. 2007). Previous studies have shown that Hosking and Wallis's homogeneity test is appropriate for delineation of homogeneous rainfall regions (Satyanarayana and Srinivas 2011), hence is considered in this study.

L-moment of data samples
L-moment is a method of explaining the probability distribution shape and evaluating the distribution parameters, especially for small sample sizes of environmental data, since it is unbiased and has a nearly normal distribution (Hosking 1990). Like usual moments, L-moments too determine the location, dispersion, peakedness, skewness, and any other feature of shape of probability distribution. However, L-moments are derived from linear combination of data (Hosking 1990). These statistics are established by modifying "probabilityweighted moments" (Greenwood et al. 1979), which explains L-moments by means of linear combinations. Sample probability-weighted moments as explained by Greenwood et al. (1979) is given below: The first few L-moments and L-moment ratios are defined as: Scale; L−Cv t 2 ð Þ : 2.3.2 Discordancy measures, heterogeneity measures, and adjustment of heterogeneous clusters Discordancy measures Discordancy measure (D i ) detects those sites which are unacceptably discordant with the designated cluster (Hosking and Wallis 1993). This discordancy value for i th site (Hosking and Wallis 1995) is given as, Here, S is the sample covariance matrix expressed as: where T means a vector comprising of the values of t, t 3 , and t 4 for i th site. Hence, Large values of D i indicate probable errors in the site data. Hosking and Wallis (1993) explained that a particular site is not considered to be homogeneous with the region if D i is more than a certain critical value, than that. D i ≥3 is suggested as the criterion for affirming a site to be discordant, for regions with 15 or more sites. However, Wallis (1993, 1995) have advised to scrutinize the dataset for the largest D i values, irrespective of their magnitude.
Heterogeneity measures Heterogeneity measures give the degree of heterogeneity existing within the region. It is estimated based on the extent of actual variability in L-moment ratios in relation to the expected variability in a homogeneous region. The heterogeneity measures to be estimated are H 1 , H 2 , and H 3 . These measures are defined based on L-Cv, L-Skewness, and L-Kurtosis. These three heterogeneity measures are given below.
i. Heterogeneity measures based on L-Cv ii. Heterogeneity measures based on L-Cv and L-Skewness iii. Heterogeneity measures based on L-Cv and L-Kurtosis Here, V is the weighted standard deviation or dispersion of the sample coefficients of L-variation (L-Cv); V 2 and V 3 denote weighted average distance from the site to group weighted mean in a 2-D space of L-Cv/L-Skewness and L-Skewness/ L-Kurtosis, respectively; μ V , μ V2 , and μ V3 are the mean of V, V 2 , and V 3 values, respectively, calculated from a large number of simulations (N sim ); V , V2 , and V3 are standard deviations of V, V 2 , and V 3 , respectively, calculated from a number of simulations. In this study, N sim is taken as 1000, simulated from kappa distribution and fitted using regional average Lmoment ratios.
Here, N indicates number of sites/stations in the clustered region; n i indicates record length of site i; t (i) , t 3 , and t 4 (R) indicate regional average L-moment ratios (L-Cv, L-Skewness, and L-Kurtosis respectively).
If H<1 for a region, it is described as "acceptably homogeneous," 1≤H<2 implies "possibly heterogeneous," and H≥2 implies "definitely heterogeneous." For further details on heterogeneity measures, Hosking and Wallis (1997) can be referred.
Adjustment of heterogeneous clusters Clusters formed by the use of clustering algorithms do not always exhibit statistical homogeneity. Even after performing homogeneity test, there is a need to adjust the possibly or definitely heterogeneous clusters to come to a definite conclusion. Furthermore, it is assumed that there is no any cross-correlation among the data. Nevertheless, in actual scenarios, there is gradual variation of rainfall across space, which implies that there exists cross-correlation among geographically contiguous sites (Satyanarayana and Srinivas 2011). So, further adjustments are essential to form homogeneous regions. Studies suggest that to decrease heterogeneity measure values, the discordant sites of one region can be either removed or shifted to some other region, after confirming that the site has not exhibited high fuzzy membership value in that cluster and discordancy of that site is not because of sampling variability (Rao and Srinivas 2006b;Satyanarayana and Srinivas 2011). The heterogeneous regions can also be broken into two or more regions, and if required, two or more small regions can be merged together.

Data quality assurance and homogenization
Missing and non-homogeneous data (presence of outliers) is an unavoidable condition in any hydrometeorological dataset. This issue occurs because of many reasons such as instrument malfunction, communication breakdown, observational errors, and fire. In developing countries, the issue is more due to scarcity of meteorological stations (Woldesenbet et al. 2017). To deal with the non-homogenous dataset, gap filling and outlier removal techniques are applied. Many techniques can be found in previous studies on gap filling and removal of outliers. In this study, gap filling is done, where needed, with the help of simple arithmetic mean as the presence of missing data is quite less in the collected dataset (explained in detail in Section 3.2). For detection and removal of outlier, z-score method is found to be one of the most applied methods in hydrometeorological studies.
This method computes a z-value to detect the presence of outliers, as shown below Here, x is the mean and S x is the standard deviation of the dataset, respectively. A data point is considered as an outlier when |z i | is found to be more than a threshold value, usually 2.5. Limitation of this method is that the outlier has an effect on both the mean value and the standard deviation. Hence, larger z-score could not be found, detecting none or a smaller number of outliers. Therefore, a modified z-score is proposed by Iglewicz and Hoaglin 1993. The equation is given below: where MAD = median of x i −x ð Þ, the median of the absolute deviations about the median.
The threshold value for modified z-score is 3.5; i.e., if the absolute value of z i is greater than 3.5, the data is considered as a potential outlier.

Results and discussion
3.1 Geographic and climate properties of the study area Figure 2 shows the study area that is situated in the upstream part of Brahmaputra River in the state of Assam, covering the central Brahmaputra valley and the eastern Brahmaputra valley region. It covers most of the upper and middle Assam districts. This region of the valley is surrounded by Eastern Himalayas towards the north, the Patkai Bum in the east, the Naga Hills in the southern side, and Meghalaya plateau in the far south. The Brahmaputra valley has a great geographical as well as political significance. It is surrounded by Bhutan in the north, Arunachal Pradesh in the north and east, and Nagaland and Karbi Anglong hills in the south. The study area lies between 25.921°N and 27.619°N latitudes and 91.896°E and 95.768°E longitudes. Mean annual precipitation varies from 859 to 3412 mm and the elevation varies from 67 to 427 m. Seasonal precipitation is found to be the highest in June-July-August (JJA) (406-1880 mm) and the lowest in December-January-February (DJF) (9-222 mm). The Brahmaputra valley has a subtropical climate which is influenced by northeast and southwest monsoon. The Meghalaya plateau, the Himalayas, and the surrounding hills of Arunachal Pradesh, Manipur, Nagaland, and Mizoram influence the climate. The monsoon winds coming from Bay of Bengal move towards the northeast and hit these mountains causing heavy precipitation on the valley.

Data used and its quality assurance
A total of 83 rain gauge stations with observed periods of varying number of years (from 5 years daily data to as long as 84 years and up to year 2016) were selected from various tea gardens of Assam. The location (longitude and latitude), seasonal precipitation (in mm), mean annual precipitation (in mm), and elevation (in m) of the rain gauge stations are shown in Table 1. Elevation of each station was determined from Shuttle Radar Topography Mission's digital elevation model data (DEM) version 2.1 (SRTM-1) of 30 m resolution. The DEM files come in tiles that come as zipped SRTMHGT files at 1-arcsecond resolution (3601 × 3601 pixels) in a latitude/ longitude projection (EPSG:4326).
Before going for the homogeneous clustering, the data collected from the teagardens were thoroughly reviewed to ensure data quality. The teagardens have been collecting these data from a long period of time; some stations even have data prior to independence. Quality of the data collected is reliable, as they were specifically and meticulously collected for maintenance of the teagardens. The data is recorded at 24-h interval, for almost all days of the recorded period. As such, not much gap has been noticed in the collected data series. As annual data series of the stations have been used in this study, hence daily datasets have been first converted to monthly data and then to annual data. Therefore, no gap filling was found to be of much use in the present study. In the most unavoidable cases, the gap has been filled with simple arithmetic mean value. Furthermore, it was ensured that only that period of the dataset is used, which have continuous monthly data.
However, removal of outlier is important before proceeding with any hydrometeorological study (González-Rouco et al. 2001;Cho et al. 2013;Woldesenbet et al. 2017). Here, modified z-score method (Iglewicz and Hoaglin 1993), as explained in Section 2.4, has been applied to detect the outliers in the dataset. For most of the stations, the z-score value has been found to be within the acceptable range (±3.5). For those data, having z-score value outside the acceptable range, the data was either replaced by the mean value of the series (for smaller deviations from acceptable range) or removed (for larger deviations from acceptable range). Corrected datasets are then used for homogeneous clustering.

Cluster analysis with fuzzy C-means (FCM) algorithm
Previous studies suggest that the attributes to be considered in clustering of homogeneous rainfall regions may include largescale atmospheric variables (LSAVs) of global climate models (GCMs) or else principal components (PCs) of GCMs, location parameters (latitude, longitude, elevation, etc.), and seasonality measures (maximum, minimum, standard deviation of rainfall, etc.). In the present study, total annual rainfall, total monthly rainfall, standard deviation of total annual rainfall and total monthly rainfall, and all three location parameters, i.e., latitude, longitude, and elevation of all the 83 rain gauge stations, were included as attributes. Here, three different combinations of feature vectors were used to form, consequently, three different sets of input data matrices (for FCM algorithm). This was done in order to observe the effect of a particular attribute in the cluster formation. The three combinations were as follows: Case 1: Input data matrix with total monthly rainfall as attributes, Case 2: Input data matrix with standard deviation of total monthly rainfall as attributes, Case 3: Input data matrix with latitude, longitude, elevation, total annual rainfall, and standard deviation of total annual rainfall as attributes.
To acquire reliable results from clustering, most crucial point is to assume the cluster number (c), since the number of regions is unknown beforehand. The best way to do that is to choose a range of values for c, and then to find out the most appropriate one. To achieve that, the cluster number c is changed from 2 to k, k being a quantity lesser than total number of sites, N. The lower bound of c is taken as 2, because the  dataset is apparently clustered into more than one group. The interesting point here is to define the upper bound of c, i.e., the parameter k. Varying the k value impacts the reliability of cluster number. From research, it was noticed that increasing the k value generates more consistent cluster number division (Mok et al. 2012). Many works can be found in literature on identification of optimal cluster numbers. In most of the cases, k≤N 1/2 is suggested as the upper bound (Xie and Beni 1991;Pal and Bezdek 1995;Mok et al. 2012). In a similar way, optimum value of fuzzifier (m) also needs to be found out. Pal and Bezdek (1995) presented that FCM algorithm works well when m varies from 1 to 2.5. It is hence suggested  (Satyanarayana and Srinivas 2011) to attain a number of sets of clustered regions by selecting a range of values for c and m, and then identify the final clustered regions based on the optimal values of c and m and by means of CVs. In this study, FCM algorithm was executed for each case with cluster number varying from c min = 2 to c max or (k) = (83) 1/2 ≈10, with increment 1. The fuzzifier value (m) is increased from 1.1 to 3.0, with increment 0.1. Change in the values of objective functions with respect to that of fuzzifier m was plotted for cluster number varying from 2 to 10, for all the three cases and are shown in Fig. 3a, b, and c. It is observed that as the fuzzifier value increases, the optimum value of objective function declines, for a given cluster number. In a similar way, the optimum value decreases with rise in cluster number, for a given value of fuzzifier. The FCM algorithm was found to perform better for m in the range of 1.5-2.5. Furthermore, the CVs described in Section 2.2 were calculated to achieve the optimal value of c and m, for each case.  1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Fig. 3 Variation in the optimum value of objective function of FCM algorithm with variation of fuzzifier m and cluster number c, for a case 1: with total monthly rainfall as attributes; b case 2: with standard deviation of total monthly rainfall as attributes; and c case 3: with latitude, longitude, elevation, total annual rainfall, and standard deviation of total annual rainfall as attributes algorithm performed better for m in the range of 1.5-2.5, the CVs for higher value of m are not considered and hence are not shown in the tables. In all the three cases, it is observed that V PC decreases monotonously, whereas V PE , V FPI , and V NCE show increase monotonously with increase in the fuzzifier value, m, for a given value of cluster number c. Moreover, they show overall monotonous decreasing (V PC ) and increasing (V PE , V FPI , and V NCE ) tendency with increase in the cluster number c, hence giving always low value of c and m as the optimum set of values for homogeneous clustering. This indicates their ineffectiveness in determining optimum number of homogeneous rainfall regions.
Extended Xie and Beni index (V XB ) as well as the Kwon index (V K ) were found to be relatively effective in these cases. They did not show any monotonous increasing or decreasing tendency with change in the values of c and m, and were in line with each other. In case of Fukuyama and Sugeno index (V FS ), although it did not have any monotonic tendency, the values were found to be in a conflict to the values of V XB , and V K in some cases. Taking into account of the similar studies on homogeneous rainfall region identification found in literature (Rao and Srinivas 2006b;Satyanarayana and Srinivas 2011;Farsadnia et al. 2014;etc.), V XB and V K were opted for clustering of rainfall regions. Henceforth, the clusters found from these two indices were considered for further analysis. The optimal value of clusters c = 3 was identified for all the three cases, using these CVs with the corresponding value of m = 1.5 for all three cases.

Homogeneity test and adjustment of heterogeneous clusters
The H values of homogeneity test for all the three cases are given in Table 2. Initially, out of the nine clusters (three clusters for each case), four clusters were acceptably homogeneous, four clusters were possibly heterogeneous, and one cluster was definitely heterogeneous. Hence, adjustments were done to make the possibly and definitely heterogeneous clusters as homogeneous, by either shifting discordant sites from one cluster to another or by removing those sites if necessary. For case 1, stations Chubwa and Dilli were found to discordant with all the clusters; hence, they are removed during adjustment which lead to three acceptably homogeneous clusters with the rest 81 stations. Similarly, station Chubwa was found to be discordant for case 2; hence, it was removed from the clusters and adjustments were done for 82 stations. However, in case 3, no stations were removed. The reason that the two stations did not show any discordancy in case 3 is because the geographic locations of the stations were considered during cluster analysis for case 3. The H values of homogeneity test after adjustment and number of sites in each cluster are shown in Table 2. Final clusters formed after adjustment are shown in Fig. 4a, b, and c.

Comparison with similar studies done previously
Although this is the first and foremost study to divide the Brahmaputra valley region of India into hydrologically homogeneous regions with the use of rainfall data record by the tea gardens by applying fuzzy clustering technique and seven cluster validity indices (CVs), few previous studies can be found on use of various clustering techniques for Indian subcontinent. Hence, comparisons have been made with those closely related studies to validate the performance of the present study. The homogeneous regions developed by Indian Meteorological Department (IMD) displays five large provinces, which are although delineated based on rainfall characteristics but are influenced by contiguity of area and administrative state boundaries. Iyengar and Basak (1994) used principal component analysis (PCA) for regionalization of Indian monsoon rainfall and recommended the PCA approach for further subdivision of the region. Ten homogeneous sequential regions were formed in India from their analysis, in which the stations of upper Brahmaputra valley regions were seemed to form similar kind of clusters as found in the present study, although few stations remained un-clustered. Singh and Singh (1996) have done regionalization of monthly as well as seasonal rainfall for sub-Himalayan areas and Gangetic plains, by using principal component analysis (PCA). They used rainfall data for a period of 114 years (1871-1984) from 90 welldistributed stations which resulted into four distinct homogeneous rainfall areas for both monthly and seasonal scales. Srinivasa and Nagesh (2007) utilized fuzzy cluster analysis (FCA) to classify 159 meteorological stations in India and concluded that FCA method performs well than the Kohonen artificial neural networks (KANN) method in finding meteorologically homogeneous groups. They utilized location parameters (latitude, longitude, and elevation) along with other meteorological parameters for clustering and the results exhibited 14 clusters over Indian region, the northeastern region being in one cluster. Satyanarayana and Srinivas (2008) have done regional frequency analysis using LSAVs that affects the precipitation in a region instead of observed precipitation data and have used K-means clustering with adjustments and L-statistics for regionalization. Seventeen homogeneous regions were formed after the analysis, two regions covering the northeastern states. The upper Brahmaputra valley region came under the same homogeneous cluster, hence producing similar results to those of the present study. Satyanarayana and Srinivas (2011) have done regionalization of rainfall data, based on fuzzy clustering method by utilizing GCM data, location parameters, and seasonal precipitation data. The stations of upper Brahmaputra valley regions were seemed to form similar kind of clusters as found in the present study. Stations in middle Assam were found to form one cluster while other stations on the upper Assam formed a different cluster. Saikranthi et al. (2013) used correlation analysis for regionalization based on seasonal and annual rainfall data. They used 51 years  daily rainfall data collected for more than 1000 rain gauges across India for the analysis, which produced 26 homogeneous rainfall zones. However, because of data scarcity, northeastern states were not included in the analysis. Bharath and Srinivas (2015) used wavelet-based global FCM analysis, instead of PCA for determining homogeneous hydrometeorological regions in India. The new approach proposed by them clustered the Indian territory into 29 regions, northeastern region having 7 clusters. The clusters formed in the upper Brahmaputra valley region were similar to the present study. Kulkarni (2017) has used probability density function to divide the Indian subcontinent into homogeneous clusters, using daily summer monsoon rainfall at 357 square grids of size 10000 km 2 . The study produced five (a) (b) (c) Fig. 4 Clusters formed by the FCM algorithm after adjustment for a case 1: with total monthly rainfall as attributes; b case 2: with standard deviation of total monthly rainfall as attributes; and c case 3: with latitude, longitude, elevation, total annual rainfall, and standard deviation of total annual rainfall as attributes clusters, out of which one cluster covered adjoining regions and all other clusters were scattered indicating irregular behavior of daily rainfall pattern in India. The study was done by using two time periods 1901-1975 and 1976-2010, and the resulting clusters were found to be extremely different in the two time periods. The clusters formed in the northeastern region are also different for the two periods which are not entirely in line with the present study. However, they used gridded rainfall data instead of station data, thus the difference. Mannan et al. (2018) have used climatic variables and self-organizing maps to regionalize India. Artificial neural network is used along with four CVs for clustering and applied on gridded rainfall dataset (0.25°× 0.25°) from IMD for 34 years  as well as climatic variables such as air temperature, surface pressure, geo-potential height, and specific humidity. Ten homogeneous regions were formed when only rainfall data was used, whereas incorporation of climatic variables divided the region into 15 regions. The region 2 in their study covered the northeast India with rainfall of 7.2 mm/day.

Conclusion
In this paper, fuzzy clustering approach has been used to classify regions with homogeneous rainfall in upper Brahmaputra valley region of northeast India. Three different combinations of feature vectors were employed in FCM algorithm to attain the best solutions to regionalization. Seven different CVs were used to determine the optimal partition in the fuzzy c-means (FCM) algorithm, out of which extended Xie and Beni index (V XB ) and Kwon index (V K ) were opted for clustering of rainfall regions owing to their satisfactory performance. The optimal value of cluster number for all three cases was identified as c=3, with corresponding value of m=1.5. The clustered regions were then assessed for statistical homogeneity by performing homogeneity tests using L-moment approach. Four clusters were found to be acceptably homogeneous.
Other possibly heterogeneous and definitely heterogeneous clusters were made homogenous by adjusting the discordant sites. It was found from the results that the clustering pattern was improved in case 3, where geographical location parameters (latitude, longitude, and elevation) were included along with local rainfall data of tea gardens. It indicates that if regionalization needs to be done at a local scale such as an average-sized watershed, further sub-clustering of a clustered region may be required. Local rainfall data, along with geographical location parameter details, can be used for the purpose since GCM data will not be of much use in this aspect because of their coarse resolution. However, a good rainfall dataset with large number of station points is required to be available within the region.