In-depth research [14–17]has been done on the problem of finding the number of clusters k in a dataset [3, 20], and other authors have provided a range of methods. To evaluate the quantity of clusters produced by clustering algorithms [24, 25] a number of validity indices have been developed in the literature. Examples include the Gap index, Silhouette index, and Elbow approach. In a significant portion of the research [12, 13], several validity indices have been investigated by applying clustering algorithms across various datasets for a range of values of k and considering k for the best dividing.
Elbow Method
Clustering, a classic machine learning approach, is crucial in data analysis. The majority of clustering algorithms rely on a predetermined number of clusters because, in reality, clusters are frequently unpredictable. The elbow method is a heuristic for estimating the number of clusters in a dataset in cluster analysis. Sum of Squared Error (SSE) is widely used as a research reference for choosing optimum clusters. Eq. (3.1), where d is the separation between the data and the cluster centre, is used to determine the Sum of Square Error (SSE).
$$SSE={\sum }_{i=1}^{n}{d}^{2}$$
3.1
The phases in the strategy are choosing the elbow of the bend as the number of clusters to utilise and plotting the sum of squared error as a function of cluster number. In the Elbow approach, we really change the K number of clusters, which ranges from 1 to 10. For each value of K, we compute the Within-Cluster Sum of Square (WCSS). The WCSS is the sum of the squared distances between the centroid and each point in a cluster. According to the Elbow technique, the total WCSS is calculated as a function of the number of clusters; the number of clusters should be selected in a way that the WCSS as a whole is not significantly improved by the addition of more clusters. We first determine the WCSS for k = 2 (two clusters), then incrementally raise k by one unit for each iteration (k = k + 1) by determining the WCSS once again. If WCSS cannot be condensed to a certain value of k, then k is the best option for the cluster size.
The plot of the WCSS with the K value resembles an elbow. The WCSS value will start to decrease as more clusters are added. WCSS respect is at its peak when K = 1. When we examine the chart in Fig. 1 closely, we can see that it will suddenly shift at a point and take the shape of an elbow. From this point, the curve starts to move in the X-axis direction. The K value, or the number of clusters, is excellent in relation to this point.
The Elbow method estimates the correct amount of clusters as follows:
-
Apply the K-means method to various k values. The value of k can range from one to the whole number of instances.
-
Calculate the total WSS for each k.
-
Plot the WSS curve based on the number of clusters k.
-
The location of a bend in the plot is sometimes used as a guide to the amount of clusters that should be used.
The elbow method's drawback is that it only evaluates one global clustering characteristic. The elbow approach is a somewhat abstract strategy that is inapplicable to specific datasets; it just indicates where the optimal value of k might be. Anyhow, the elbow approach isn't always effective, especially if the data aren't strongly clustered. The number of clusters discriminant is dependent on the manual determination of the elbow points on the visualisation curve, even if the elbow technique is one of the most used methods for establishing the optimal cluster number. Experienced analysts find it difficult to quickly identify the elbow point from the exhibited curve when the curve is reasonably smooth [22].
Silhouette Index Method
The silhouette value measures an object's cohesion with its own cluster in comparison to other clusters (Separation). The silhouette plot, as shown in Fig. 2, illustrates the percentage of points in one cluster that are close to points in neighbouring clusters, providing a method for estimating survey boundaries such as the number of clusters that extend beyond them. S(i), the silhouette index, shall be located between [-1, 1]. The score ranges from − 1 for inadvertent grouping to + 1 for extraordinarily dense clustering. Overlapping clusters are visible in scores near zero.
The idea of silhouette width makes a contrast between within-cluster tightness and isolation from the rest [11, 18]. The average silhouette is used to assess a grouping's quality. In compared to other items, it determines how well each thing fits into its cluster. A high average silhouette width indicates a healthy clustering. The definition of a data point's silhouette value is given in Eq. (3.2) as follows:
$${s}\left({i}\right)=\frac{\text{b}\left(\text{i}\right)-a\left(i\right)}{\text{m}\text{a}\text{x}\{a\left(i\right),b\left(i\right)\}}$$
3.2
where the average distance between the ith point and other points in the same cluster is denoted by the letter a(i). B represents the average separation between the ith point and the points in the kth cluster (i,k). When the ith point is not assigned to a cluster, b(i) is the lowest of all b(i, k) over all clusters. The ith point is clearly contained within its own cluster because s(i) is close to one and not more than one. The disadvantage is that for convex clusters, the Silhouette Coefficient is typically greater.
The steps in the silhouette technique are as follows:
-
Variation of k (number of clusters) from 1 to max is used to run the K-means clustering algorithm.
-
Calculate the average silhouette of observations for each of the k variables (silavg).
-
Plot the ‘silavg' curve as a function of the number of clusters (k).
-
The proper number of clusters is determined by where the maximum is found.
The best cluster number in the dataset under consideration can be found in the interim using gap statistic approaches. The processes used by the gap statistic methodology [23, 26] to find the likely optimal cluster number are as follows: The right cluster number is determined using the K-means output and a comparison of the change in intra-cluster dispersion.
A method for roughly estimating the "correct" number of clusters, k, for an unsupervised clustering is the gap statistic. As seen in Fig. 3, this is accomplished by assessing an error measurement (within cluster sum of squares) in relation to our choice of k.
The following is how the algorithm works:
1. Cluster the observed data with k = 1,..., kmax clusters and calculate the related total within intra-cluster variation Wk.
2. Create B random uniform distribution reference data sets. Each of these reference data sets is clustered with varied numbers of clusters k = 1,..., kmax, and the associated total within intra-cluster variance Wkb is computed.
3. Calculate the estimated gap statistic as the deviation of the observed Wk value under the null hypothesis as shown in Eq. (3.3)
$$Gap\left(q\right)=\frac{1}{B}{\sum }_{b=1}^{B}\left({\text{log}W}_{kb}-{\text{log}W}_{k}\right)$$
3.3
4. Calculate the statistics' standard deviation as well.
5. Choose the least value of k for the number of clusters so that the gap statistic at k + 1 is within one standard deviation of the gap as shown in Eq. (3.4)
Gap(k) ≥ Gap(k + 1) − s k + 1. (3.4)
Calculating all-pairs distances within each cluster is a drawback. They demonstrate that sampling is sufficient, but it does add some complication. Additionally, it lacks a convex or monotonic capacity, which makes determining the ideal point more challenging. Since it is a "relative" measurement, it is useless for making decisions when there aren't any notable clusters.
As can be seen, each of the three aforementioned algorithms has a distinct set of characteristics. The simple complexity Elbow Method algorithm uses SSE as a performance metric. By iterating over the K value, the inflection point is determined. The relationship between the K and distance numbers determines the inflection point, which is a flaw. If the inflection point is not obvious, the K value cannot be determined.
The Silhouette Coefficient approach uses cluster cohesion and separation to perform a cluster analysis. Minimizing cohesiveness is similar to maximising separation, combining it with S(i), and crossing the K value. When S(i) is maximal, K is the optimal number of clusters. The problem is that the computational complexity is O(n2) since the distance matrix must be computed, and as a result, the volume of data can be close to one million or even ten million. This strategy is not suggested for big data sets due to the substantial processing overhead.
The Gap Statistic approach compares the predicted value of the averaged reference data set with the observed data set in order to determine the k value that drops the fastest. On the other hand, because to its time and space complexity, this method is not suitable for many real-world data sets. The value of k acquired from all the current methods may occasionally disagree, leading to contradiction for the same dataset, even if there are several validity indices ways to estimate the ideal number of clusters, k, to execute clustering in WSNs. As a result, an effort is made to propose the Variance Difference Method (VDM), an algorithm that executes clustering by predicting the ideal number of clusters K, for the given dataset.