Redefining homogeneous climate regions in Bangladesh using multivariate clustering approaches

The knowledge of the climate pattern for a particular region is important for taking appropriate actions to alleviate the impact of climate change. It is also equally important for water resource planning and management purposes. In this study, the regional disparities and similarities have been revealed among different climate stations in Bangladesh based on an adaptive clustering algorithms that include hierarchical clustering, partitioning around medoids, and k-means techniques under several validation measures to several important climatological factors including rainfall, maximum temperatures, and wind speed. H1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H_{1}$$\end{document} statistics based on the L-moment method were used to test the homogeneity of identified clusters by the algorithms. The results suggest that the climate stations of Bangladesh can be grouped into two prime clusters. In most cases, one cluster is located in the northern part of the country that includes drought-prone and vulnerable regions, whereas, the second cluster contains rain-prone and hilly regions that are found mostly in the southern part. In terms of cluster size and homogeneity, all clusters have been identified. In contrast, the clusters identified by the hierarchical method for all three factors are either homogeneous or reasonably homogeneous. The implementation of principal component analysis to climate station data further reveals that three latent factors play a vital role to address the total variations.


Introduction
Cluster analysis is frequently used as a statistical approach to find the spatial patterns or homogeneous climate zones of climatological factors. Climatological factors such as rainfall, temperature, cloud cover, humidity carry vital and key roles in climate policies and water resources management. One of the most crucial factors in this regard is rainfall which is the primary source of water for global agriculture. Rainfall is extremely changeable over time and place. So the proper monitoring and knowledge of such climatological factors help in the planning of water resources management, identifying acute water scarcity, regionalizing drought, pre-planning strategy for coping, flood control planning, drought monitoring planning, and drainage system (Ullah et al. 2020;Khan et al. 2021). Identifying the homogeneous climate regions is also important for detecting climate change and variability (Delitala et al. 2000), and understanding eco-hydrological processes (Oguntunde et al. 2006;Machiwal et al. 2017).
Several clustering and other statistical approaches have been applied in the literature to identify the climatological disparities/ similarities among different regions. For instance, hierarchical cluster analysis (HCA) (Jebari et al. 2007; Doulah and Islam 2019), L-moment method (Rahman et al. 2013), multivariate regression (Sabziparvar et al. 2015), Bayesian kriging techniques (Gupta et al. 2017), k-means cluster analysis , spatial correlation (Jebari et al. 2007), Pearson's correlation (Haines and Olley 2017), Partitioning around medoids (PAM) (Di Giuseppe et al. 2013), support vector machines and self-organizing linear output map (Lin et al. 2017), regional frequency analysis (Kyselý et al. 2007;Medina-Cobo et al. 2017) and principal component analysis (PCA) (Basalirwa 1995); the fuzzy clustering (Goyal and Gupta 2014;Goyal et al. 2019), probability density function (Kulkarni 2017), spectral analysis (Azad et al. 2010), artificial neural networks (Matulla et al. 2003) are some of the approaches frequently used in the literature. Both HCA and PCA , fuzzy c-means (Goyal et al. 2019), and k-means cluster analysis (Srinivasa Raju and Nagesh Kumar 2016) were also used to find homogeneous climate zones in India. The HCA was also applied on temperature and rainfall to identify the homogeneous climate zones in the United States (Fovell and Fovell 1993) and on daily temperature, relative humidity, solar radiation, and wind speed in China (Xiong et al. 2019). The homogeneous climate regions were identified by exploring daily temperature and relative humidity data using multivariate clustering approaches in Iran (Modarres 2006;Roshan et al. 2021) and Malaysia (Yunus et al. 2011;Ahmad et al. 2013;Mathulamuthu et al. 2016). The multivariate clustering approaches were also applied to divide Bolivia (Abadi et al. 2020) andPakistan (Hussain andLee 2009;Ali et al. 2019) into identical and coherent climatological sub-regions.
All clustering methods do not perform identically for all problems or data (Rodriguez et al. 2019) and it is important to find a suitable method for precise findings, predictions (Mahmud and Islam 2019), and fostering reproducibility (Zhang et al. 2016). In the near past, a study in Turkey (Unal et al. 2003) showed HAC is the most suitable algorithm among five multivariate clustering approaches for temperature data. Matulla et al. 2003 also observed huge differences in terms of accuracy of clustering among several multivariate clustering approaches for different seasonal and periodical rainfall data. It is also inspected that the k-means clustering approach outperforms for the data with outlines (Chen et al. 2005), and the fuzzy c-means provides outstanding performance for simulated overlapping data with outlines but hierarchical and k-means provide identically bad performance (Mingoti and Lima 2006).
In Bangladesh, a small number of studies have been conducted (Doulah and Islam 2019; Alam and Paul 2020) with the implementation of HCA, k-means clustering, and Fuzzy c-means to rainfall data for identifying the homogeneous climate regions. The HCA and PCA were also used to find homogeneous regions by analyzing annual and seasonal rainfall data (Rahman et al. 2018). However, for the first time in this study suitable scientific methods i.e. hierarchical, k-means, and PAM clustering have been applied to several climatological factors for identifying the homogeneous climate zones in Bangladesh. In addition, the study uniquely used seven cluster validation indexes to identify the most suitable clustering approach and the optimal number of clusters. Since Bangladesh is an agricultural country and depends a lot on nature, the findings of this study could be helpful to policymakers of Bangladesh.

Study area and data
The study area is exclusively Bangladesh which is situated in the South Asian region with two different environments such as the Bay of Bangle to the South and Himalayas to the North extends from latitudes 20°34ʹ-26°38ʹ north and longitudes 88°01ʹ-92°41ʹ east and it covers an area of more than 140,000 sq km (Ahasan et al. 2011). Bangladesh has a subtropical monsoon climate characterized by wide climatological diversities in high temperature and humidity. To observe climatological diversities, the monthly average of different climatological factors namely rainfall (mm), relative humidity (%), maximum temperature (°C), minimum temperature (°C), wind speed (m/s), cloud cover (Octa), sunshine hour (hour), and sea level pressure (millibars) from 1949 to 2017 were taken into consideration. The data were recorded at 34 rain gauge stations (latitude, longitude) by Bangladesh Meteorological Department (BMD) (https:// www. bmd. gov. bd/): Barisal (22.7,90.36 Teknaf (20.87,92.26). The locations of the rain gauge stations are also presented in Fig. 1 and the distribution of the monthly average of the climatological factors in each station is presented in Fig. 2. The large variation among monthly average rainfall was observed in Cox's Bazar, Kutubdia, Sandwip, and Teknaf. The relative humidity in Bogra, Dinajpur, Rangpur, and Teknaf was a more active variable than that was in other stations. Chuadanga, Khepupara, Khulna experienced more variation of sea level pressure than other stations. The maximum temperature was a more active variable in Chuadanga, Rajshahi, Srimangal than other stations.

Data preparation
It is accepted that data with different units of measurement might influence the clustering results. To avoid this influence, the data should be normalized with appropriate transformation functions. Before performing the clustering analysis, we have used the following transformation function to normalize the data (Dikbas et al. 2013).
where NV yi is the normalized annual maximum value in station i, V i is the annual maximum value at station i, V max and V min are respective maximum and minimum values of the station. Furthermore, missing values is a very common phenomenon in climatological data. The analysis of climatological factors is often highly affected by missing values. In this study, we have estimated the missing values by using the Multiple Imputations by Chained Equations (MICE) method (White et al. 2011;Mahmud et al. 2020) which is also known as sequential regression multivariate imputation. (1)

Methodology
Clustering is a popular multivariate approach to identify dimensionality, and associated variables or features (Johnson and Wichern 2002). In climatology, clustering is a core technique for disclosing the hidden pattern of the climatological or hydrological features (Ouyang et al. 2010). Most of the techniques create several groups among all the objects from the data using similarity or dissimilarity measures (distance measures). In this study, the clustering approaches: hierarchical, k-means, PAM, and PCA applied to identify homogeneous climate regions.

Hierarchical clustering
The hierarchical clustering method is widely used for climatological or hydrological pattern discovery (Ouyang et al. 2010). It works out by building a hierarchy of the tree limbs using the divisive and agglomerative approaches (Giraldo et al. 2012). The hierarchical clustering methods are usually grouped into two major techniques, namely, agglomerative: a series of successive merges, divisive: a series of successive divisions. Because of the computational and technical feasibility, agglomerative methods are more widely preferred in applications rather than the divisive method. In the agglomerative approach, each of the objects in the data set is initially considered as a cluster. Then they create subgroups by combining the most similar or closed objects. After that, these clusters are merged according to their similarity or closeness (Johnson and Wichern 2002). Moreover, the agglomerative hierarchical clustering approach can be summarized into the following steps (Iyigun et al. 2013): Step1: Start with the initial n objects or clusters, and calculate the proximity matrix for each of the clusters.
Step 2: In the proximity matrix, search for the minimal distance dis(cl i , cl j ) = min dis(cl s , cl r ) 1<k,l<n,l≠k where dis(.., ..) express the distance measure and combine cluster cl i and cl j to construct a new cluster cl ij .
Step 3: Measure the distances between the clusters cl ij and the other clusters and update the proximity matrix.
Step 4: Repeat Steps 2 to 3 until only one cluster remains.
However, several proximity measures are available for the agglomerative hierarchical clustering approach namely single linkage, complete linkage, average linkage, and Ward's method which are called linkage matrices. In this study, we have used Ward's method (Ward Jr 1963) that merges two clusters from n number of distinct clusters by minimizing the amount of increase in the error sum of squares (ESS) or loss of information until all of the objects are in a single cluster after n − 1 steps. The error sum of square is defined as: where x ijk is jth observation of kth variable in ith cluster, x i.k is ith the cluster mean (centroid) of kth variable. ESS i is the sum of squared deviations of every object in the ith cluster from the cluster mean (centroid). In the current stage if we have p clusters then ESS is the sum of the ESS p . At every step, the union of every possible pair of clusters is considered, and the pair of clusters that contribute the smallest increase in the error sum of squares or loss of information are merged. (

k-means clustering
K-means method is a popular nonhierarchical clustering technique (MacQueen 1967). It creates a specific number of partitions or groups among items instead of variables. In this technique, each object assigns to a specific partition or cluster in a way such that the cluster's centroid or mean has the lowest distance from that corresponding object compared to other cluster's centroids and these centroids represent the cluster. This clustering, also known as a partitioning approach (Johnson and Wichern 2002) which based on the minimization of the objective function and separates the objects into groups. This approach creates and classifies l number of variables and P number of data belonging to Y data set into k number of groups (Burn and Goel 2000). In this clustering approach, first, we need to specify the number of clusters k and each object assigned to the nearest cluster center with the help of a similarity measure. After the assignment of each object to a cluster, the cluster center for all the k clusters is recalculated, and the objects are sifted to different clusters according to the position of the newly calculated cluster centers. This procedure is repeated until there is no shift in cluster centers. The Euclidian distance measure is used to determine the similarities of the objects to the cluster centers. Let Y be the data set in P × t dimensions, t is the number of objects, P is the number of objective vectors in the Y data set for the particular problem of interest. The kth objective vector in the n dimensional data set may be written as: y k = [y k1 , y k2 ,… y kn ] T , y k ∈ R t . In this situation, the data set constituted by P objective vectors is defined where M is the discrimination matrix in P × c dimensions, Q = {q 1, q 1,…. q c } is a set of cluster centers, y k (i) is kth objective vectors of ith cluster, and c indicates the number of clusters. The distance between kth objective vectors of ith cluster and the cluster center is 1} is a membership degree of kth objective vectors of ith cluster. The kth objective vector is assigned in ith cluster if m ik = 1 and not assigned if m ik = 0.

Partitioning around medoids (PAM)
PAM (Kaufman and Rousseeuw 1987) is an unsupervised clustering algorithm that creates partitions or groups among objects by minimizing the within-cluster sum of squares similar to k-means clustering. The PAM, however, is considered to be more robust since it counts other dissimilarities besides the distance measures. The PAM clustering approach also selects an object randomly as a medoid. Then the remaining objects are assigned to the nearest medoid. This process reiterates until no change takes place in medoids (Bhat 2014). Suppose, k objects C = {c 1 , c 2 , c 3 , … c k } are randomly selected from n data points, D = {d 1 , d 2 , d 3 , … d n } as the medoids so that c i is the cluster represented by medoid d i . Now, d i is a current medoid and we need to decide whether it should be swapped with a non-medoid object d h , the swapping is wise to take place if the cost TD (total distance to cluster medoids) is minimum. Here, TD ih is the total cost by a cluster medoid swapping presented by Eq. 4 (Preud'homme et al. 2021).

Cluster validation
The cluster validation is an inevitable part of cluster analysis for evaluating the goodness of clustering methods (Dubes and Jain 1979). In this study, to identify the best clustering approach and the optimal number of clusters we have used internal and stability validation measures by using the R package "clvalid" (Brock et al. 2008).

Internal validation
The internal validation considered three commonly used indexes: connectivity, Silhouette Width, and Dunn. The connectivity (Handl et al. 2005) revealed the extent of similarity among the members of a cluster that are placed in the same group or cluster as the nearest neighbor of other members. Let C = {c 1 , c 2 ,… c k } be a set of clusters of the P observations where x inn i(j) = 1 if nn i(j) and I are in the same cluster and x inn i(j) = 1 j otherwise. Then, the connectivity can be calculated as: . The Silhouette Width measures how well objects are clustered and estimates the average distance between clusters. Suppose C = {c 1 , c 2 ,… c k } is a set of clusters and j ∈ C andC ∈ C but j ∉C. The average dissimilarity between j and other objects of a cluster c j is a j and the smallest average dissimilarity between each observation j and all observations of other clusters C is defined as d j = minC d(j,C) . Then silhouette width of observation j can be calculated as S j = (b j − a j )∕ max(a j , b j ) . The objects with large S j (close to 1) are well clustered whereas, small S j (around 0) indicates objects lie between two clusters.
The Dunn Index is another internal validation measure that calculates the distance between each of the observation j in the cluster and the observations in the other clusters C . Then compute minimum intercluster separation min.s and maximum intra-cluster distance max.d. Then the Dunn index is calculated as D = min .s max .d .The large value of the Dunn Index indicates well clustered.

Stability validation
The stability validation includes four different measures: the average proportion of nonoverlap (APN), the average distance (AD), the average distance between means (ADM), and the figure of merit (FOM).
The APN computes the average proportion of objects not occupied in a similar cluster from the clustering based on full data and clustering based on data with a single column removed. Then, APN can be defined as: N is the number of observations and P is the number of variables; n(C i,0 ) is the number of clusters containing observation i from clustering based on full data and C i,l are the clusters containing observation i where the clustering is based on the dataset with column l removed. The values of APN lies between 0 and 1 and a value close to zero indicates consistent clustering results. The AD calculates the distance between the objects allocated in a similar cluster from the cluster based on full data and the cluster based on data with a single column removed. The AD can be calculated by Eq. 6.
The value of AD lies between 0 and ∞ , a smaller value of AD indicates good clustering.
The ADM indicates the distance between the groups' centers for the objects occupied in a similar cluster from clustering based on full data and clustering based on data with a single column removed. ADM is defined as: where x C i,0 is the mean of the observations in the cluster based on full data and x C i,l is the mean of the observations in the cluster that contains the data with column l removed. Similarly, the value of ADM lies between 0 and ∞ and a smaller value of ADM indicates good clustering.
The FOM is a monumental measure of stability validation that assesses the means of intra-cluster variance among the removed column, whereas the clustering occurred based on the unremoved column of the data. The FOM can be determined by Eq. 8. where x i,l is the ith observation in the j variable, x(l) is the average of the cluster. The values of FOM also lie between 0 and ∞ and a smaller value indicates good clustering.

Regional homogeneity test
A homogeneity test (H statistics) known as the L-moment method (Hosking and Wallis 1997) has been used for testing the homogeneity of the clusters identified by cluster analysis. Based on L-moment ratios, L-coefficient of variation (L-Cv), L-coefficient of skewness (L-Cs), and L-coefficient of kurtosis (L-Ck) are three versions of H statistic that are delineated as H 1 , H 2, and H 3 (Alam and Paul 2020). In this study, H 1 version of the H statistic has been applied since it indicates heterogeneity or potential heterogeneity more frequently than H 2 and H 3 (Kyselý et al. 2007;Alam and Paul 2020). The homogeneity is evaluated by fitting the four parameters Kappa distribution to the regional data sets (Hosking and Wallis 1997). Then sample variance of L-moment ratios L-Cv is V 1 and can be calculated by using Eq. 9.
where n j -number of data points in the objective vector j among p vectors; L_Cv -the average of values of L_Cv j for p vectors; L_Cv j -the value of L_Cv for jth vectors. Then, the H 1 can be computed using Eq. 10.
where V 1 and V 1 are the average and standard deviation of L-moment ratios L_Cv obtained from the simulation. Hosking and Wallis (Hosking and Wallis 1997)

Principal component analysis (PCA)
The PCA algorithm finds a small set of variables that captures the maximum variation of the data set by reducing the dimensionality. The PCA is also used to classify objects into one or more clusters based on similarity. PCA identifies the pattern by explaining the variation of a large number of associated variables and replacing them with a smaller set of independent factors (Dillon and Goldstein 1984;Machiwal et al. 2019). Let variable V(i,j) at the ith location in the jth time be presented as the sum of the products of the coefficients Cf i varying over the location and correlated with eigenvectors Eg j . Then V(i,j) can be expressed by Eq. (11).

Findings and discussions
Initially, we created a Heatmap for all 34 rain gauge stations based on considered climatological factors. Heatmap is a way of visualizing hierarchical clustering to identify groups of factors that are dependent on each other and have a joint effect on climate (Van Loon and Laaha 2015). Here, Heatmap shows that those stations can be presented by three groups of climatological factors such as group 1: maximum temperature (maxT), and sea level pressure (slevel); group 2: relative humidity (rhum), minimum temperature (mint), and rainfall (rain); group 3: wind speed (windsp), sunshine hour (sunshine), and cloud cover (cloudc) (see Fig. 3). Finally, we have selected 3 factors namely maximum temperature, rainfall, and wind speed as the most representative factors to the respective groups for identifying the homogeneous climate regions. The cluster analysis was performed by choosing the optimal number of clusters. To find the optimal number of clusters and suitable clustering approach, we have calculated seven cluster validity indexes. However, all three clustering approaches namely hierarchical, k-means, and PAM were performed for making a comparison. After performing the clustering analysis, we have tested for homogeneity of the clusters identified by the clustering methods using H 1 statistics based on the L-moments ratio. We have also used PCA for clustering purposes. All analyses were performed using the statistical software R (version 4.1.2).

Optimal number of cluster and clustering approach
For identifying the optimal number of clusters and suitable clustering approaches for the selected three factors rainfall, maximum temperature, and wind speed, seven validity indexes namely connectivity, Dunn, Silhouette, APN, AD, ADM, and FOM were firstly calculated. All these measures were calculated by choosing the number of clusters to be two and increased to 10 and presented in Fig. 4. Most of the measures indicate that the optimal number of the cluster for maximum temperature is two for all three factors. More specifically, the lowest connectivity (2.93), APN (0.02), ADM (0.11), and the highest Dunn index (0.53) and Silhouette Width (0.27) were observed for the maximum temperature at the number of clusters two (Table 1). Similarly, the lowest amount of connectivity (2.93), APN (0.001), ADM (0.00), AD (1.59), and FOM (0.53) for rainfall are inspected at the number of clusters two. Except for AD and FOM, all indexes show that the optimal score for wind speed is observed at the  Table 1 also shows that most of the indexes suggest that hierarchical clustering might be a suitable approach for rainfall and wind speed. However, for maximum temperature, three measures silhouette, APN, ADM suggest k-means clustering approach and three measure connectivity, Dunn and AD suggest hierarchical clustering.  Figure 5 shows the clusters formed in agglomerative hierarchical clustering for all three factors namely maximum temperature, rainfall, and wind speed. Here different colors indicate different clusters (blue represents Cluster-1 and green represents Cluster-2). The geographical distribution of the clusters identified by the hierarchical method is presented in Fig. 6. The hierarchical clustering approach assigned 34 stations into two clusters for maximum temperature, rainfall, and wind speed: Cluster-1, with 11, 21, and 23 stations, respectively, and Cluster-2, having 23, 13, and 11 stations, respectively. Figure 6 indicates that the maximum temperature has low magnitude and variability in Cluster-1 which is located in the Southeast and Eastern hilly regions of Bangladesh including Foridpur and excluding Chittagong and Rangamati. Cluster-1 for rainfall mostly contains plain and arid climate regions and covers Northwest and North Central parts of Bangladesh. While Cluster-2 for rainfall and wind speed experiences large variability, it contains mostly the hilly and rain-prone regions in Bangladesh. The k-means clustering approach assigned 34 stations into two clusters for maximum temperature, rainfall, and wind speed: Cluster-1, with 10, 23, and 30 stations, respectively, and Cluster-2, having 24, 11, and 4 stations, respectively (see Fig. 7). The spatial distribution of clusters is identified by k-means clustering and the distribution of maximum temperature, rainfall, and wind speed in each of the clusters is presented in Fig. 8. Cluster-1 for maximum temperature and Cluster-2 for rainfall mostly cover the Southeast and Eastern hilly parts of Bangladesh and have larger variability. The PAM approach recognized 11 stations based on maximum temperature as Cluster-2 which has higher variation than cluster 1 (see Figs. 9 and 10). Cluster-1 for rainfall contains 23 stations and exhibits lower variability than Cluster-2. Cluster-1 covers the Southeast, Northeast, and Eastern hilly regions of Bangladesh (Fig. 10). The PAM method for wind speed assigned 9 stations in Cluster-2 which has large variability and is located in the southern part of Bangladesh (see Figs. 9 and 10).

Principal components analysis (PCA)
The principal components analysis was applied to diminish dimensions and to generate an effective description of the climatological factors, where differences between Fig. 6 Spatial distribution of clusters identified by hierarchical clustering and distribution of maximum temperature, rainfall, and wind speed in each of the clusters Fig. 7 k-means clustering for maximum temperature, rainfall, and wind speed different groups could be seen vividly . A leading portion of the variation was mostly explained by the first two principal components (PCs) for each of the climatological factors: 69-77% for maximum temperature, 82-89% for rainfall, and 86-97% for wind speed. According to the association plot between the first PC and second PC (Fig. 11), all 34 stations can be classified into 3 latent clusters for each of the climatological factors namely maximum temperature, rainfall, and wind speed.

Homogeneity of the clusters
The newly defined clusters, for maximum temperature, rainfall, and wind speed, by different clustering techniques are tested for homogeneity using the H 1 statistics based on the L-moments (Hosking and Wallis 1997). The test results are presented in Table 2. Results show that the values of H 1 are less than the critical value (H 1 < 1) for both clusters identified by the hierarchical method based on the rainfall. Hence, the clusters from the hierarchical approach based on rainfall are reasonably homogeneous. Cluster-2 for maximum temperature and wind speed are recognized as probably homogeneous since 1 < H 1 < 2 and Cluster-1 for maximum temperature and wind speed are acceptably homogeneous (H 1 < 1). K-means Cluster analysis using principal components for maximum temperature, rainfall, and wind speed clustering defined definitely heterogeneous cluster (Cluster-2) for both maximum temperature and wind speed (H 1 > 2). All clusters based on maximum temperature and rainfall using the PAM method are acceptably homogeneous. However, the PAM algorithm formed one definitely heterogeneous cluster and one probable homogeneous cluster for the wind speed.

Comparison with previous studies
This is the first of this kind of study where an adaptive multivariate technique has been used to identify homogeneous climate zones through applying different clustering methods-hierarchical, k-means, and PAM under seven cluster validity measures and several climatological factors for Bangladesh. There are a few studies that used other traditional methods to other country data. So, it's worth comparing the findings of the proposed work with those of other studies that are closely related to the proposed method. Goyal and Gupta (2014) identified homogeneous rainfall regimes in the northeast region of India using the Fuzzy cluster and compared them with the k-means clustering method. They found that the Fuzzy clustering identified a sufficiently more homogeneous region than the k-means clustering approach. Alam and Paul (2020) applied hierarchical, k-mean, and Fuzzy clustering to identify homogeneous rainfall regions in Bangladesh and found that the Fuzzy clustering outperformed the other methods. However, they neither considered the cluster validation nor other factors other than the rainfall. Dikbas et al. (2012) used the fuzzy cluster approach to classify rainfall series and identify hydrologically homogenous zones in Turkey, and the results of the FCM and k-means methods were compared. The FCM was outperforming the other methods. For analyzing the streamflow processes Dikbas et al. (2013) found that the k-means algorithm can be sufficiently used to identify the hydrologically homogeneous regions in Turkey.

Conclusion
Determining the regional disparities among the climate regions is an important tool to know about the fundamental features of the geographical environment. In this study, more important meteorological variables such as the maximum temperature, rainfall amount, and wind speed have been taken into consideration to address the regional disparities in Bangladesh by using appropriate multivariate approaches. Most strikingly, the findings of this study provide an equal number of clusters for all climatological factors. The clusters identified by Hierarchical, k-means, and PAM clustering approaches are not identical in terms of cluster size and within-cluster homogeneity. All clusters identified by the hierarchical method are either reasonably homogeneous or probably homogeneous whereas k-means identified two and PAM one definitely heterogeneous cluster. Based on the results of the statistical validity tests and homogeneity test, it is reasonable to claim that the hierarchical clustering approach might be one of the suitable algorithms to identify a homogeneous region in Bangladesh. These groupings might play a vital role in policymaking and implications stages for the stakeholders including the government of Bangladesh.