A Variance Difference Method for Determining Optimal Number of Clusters in Wireless Sensor Networks

doi:10.21203/rs.3.rs-1984952/v1

Download PDF

Research Article

A Variance Difference Method for Determining Optimal Number of Clusters in Wireless Sensor Networks

https://doi.org/10.21203/rs.3.rs-1984952/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Sensor networks are frequently employed to keep an eye on rapidly changing, dynamic environments. Low latency, energy efficiency, coverage difficulties, and network lifetime are seen to be the most important problems in wireless sensor networks. Cluster-based wireless sensor networks require additional study to overcome issues with energy efficiency and network lifespan. Finding the ideal number of clusters with the goal of reducing energy consumption is one of the primary challenges in cluster-based networks. The right value for k relies on the shape and size of the point distribution in a data collection, as well as the user's preferred level of clustering resolution. Additionally, if each data point is taken into account as its own cluster, increasing k without suffering any penalties diminishes the degree of accuracy in the resulting clustering until it reaches zero. Hence, Variance Difference Method (VDM) is proposed in order to determine the ideal number of clusters K and to carry out clustering in WSN. Elbow method, Silhouette method, and Gap statistic method performance is also reviewed and contrasted with that of the suggested VDM in order to demonstrate that the proposed VDM performs better than Elbow method, Silhouette method, and Gap Statistic method.

Clustering

Optimal number of Clusters

Wireless Sensor Networks (WSN)

A sensor network is made up of numerous Sensor Nodes (SNs), which are dispersed throughout the phenomenon or very close to it. The placement of sensor nodes doesn't have to be planned or engineered. This makes it possible to randomly deploy sensor nodes[1, 2]. The cooperation of the sensor nodes is another characteristic of sensor networks. Instead of sending the raw data to the fusion nodes, sensor nodes use their processing capacity to locally perform simple computations and transmit just the necessary and partially processed data. Numerous environmental elements such as temperature, pressure, automobile motion, lightning situation, speed, direction, etc., can be monitored through sensors.

WSNs are resource-restricted networks made up of sensor nodes with constrained computing, communication, and battery capabilities. Since one of the main design objectives of Wireless Sensor Networks is to maximise the network's lifetime by effectively utilising the available energy of all sensor nodes, many researchers have been drawn to the subject of energy-aware communication in WSNs. The current energy-efficient routing protocols for Wireless Sensor Networks (WSNs) based on clustering techniques are critically concerned with how to establish the appropriate number of clusters, how to generate clusters, and how to choose cluster-heads to optimise WSN performance. These techniques are meant to reduce the energy consumption of WSN nodes and increase network longevity [19, 20].

The interaction between sensor networks has been improved through the application of machine learning. When numerical models are unavailable or too expensive, machine learning approaches are useful and practical tools for information extraction and correlation detection in sensor networks. Cluster head node selection, optimal path detection, extracting useful information from collected data, reducing packet delay, and extending WSN lifetime [4–7] are all achieved using machine learning (ML) algorithms such as supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and computational intelligence.

Since one of the main design objectives of wireless sensor networks is to maximise the network's lifetime by effectively using the available energy of all sensor nodes, many researchers have been drawn to the field of energy-aware communication in these networks. Numerous machine learning-based algorithms have been employed to monitor effective energy gathering techniques for WSNs. Energy harvesting based on machine learning has a number of benefits. A machine learning algorithm that determines the quantity of energy to be harvested within a specific time frame enhances the presentation of WSNs. Increasing energy efficiency in wireless sensor networks is a difficult task. Since traditional WSN techniques are blatantly tailored, it is challenging for systems to react gradually. In these situations, machine learning techniques can be utilised to respond based on the demands [8–10].

In-depth research [14–17]has been done on the problem of finding the number of clusters k in a dataset [3, 20], and other authors have provided a range of methods. To evaluate the quantity of clusters produced by clustering algorithms [24, 25] a number of validity indices have been developed in the literature. Examples include the Gap index, Silhouette index, and Elbow approach. In a significant portion of the research [12, 13], several validity indices have been investigated by applying clustering algorithms across various datasets for a range of values of k and considering k for the best dividing.

Elbow Method

Clustering, a classic machine learning approach, is crucial in data analysis. The majority of clustering algorithms rely on a predetermined number of clusters because, in reality, clusters are frequently unpredictable. The elbow method is a heuristic for estimating the number of clusters in a dataset in cluster analysis. Sum of Squared Error (SSE) is widely used as a research reference for choosing optimum clusters. Eq. (3.1), where d is the separation between the data and the cluster centre, is used to determine the Sum of Square Error (SSE).

$$SSE={\sum }_{i=1}^{n}{d}^{2}$$

3.1

The phases in the strategy are choosing the elbow of the bend as the number of clusters to utilise and plotting the sum of squared error as a function of cluster number. In the Elbow approach, we really change the K number of clusters, which ranges from 1 to 10. For each value of K, we compute the Within-Cluster Sum of Square (WCSS). The WCSS is the sum of the squared distances between the centroid and each point in a cluster. According to the Elbow technique, the total WCSS is calculated as a function of the number of clusters; the number of clusters should be selected in a way that the WCSS as a whole is not significantly improved by the addition of more clusters. We first determine the WCSS for k = 2 (two clusters), then incrementally raise k by one unit for each iteration (k = k + 1) by determining the WCSS once again. If WCSS cannot be condensed to a certain value of k, then k is the best option for the cluster size.

The plot of the WCSS with the K value resembles an elbow. The WCSS value will start to decrease as more clusters are added. WCSS respect is at its peak when K = 1. When we examine the chart in Fig. 1 closely, we can see that it will suddenly shift at a point and take the shape of an elbow. From this point, the curve starts to move in the X-axis direction. The K value, or the number of clusters, is excellent in relation to this point.

The Elbow method estimates the correct amount of clusters as follows:

Apply the K-means method to various k values. The value of k can range from one to the whole number of instances.
Calculate the total WSS for each k.
Plot the WSS curve based on the number of clusters k.
The location of a bend in the plot is sometimes used as a guide to the amount of clusters that should be used.

The elbow method's drawback is that it only evaluates one global clustering characteristic. The elbow approach is a somewhat abstract strategy that is inapplicable to specific datasets; it just indicates where the optimal value of k might be. Anyhow, the elbow approach isn't always effective, especially if the data aren't strongly clustered. The number of clusters discriminant is dependent on the manual determination of the elbow points on the visualisation curve, even if the elbow technique is one of the most used methods for establishing the optimal cluster number. Experienced analysts find it difficult to quickly identify the elbow point from the exhibited curve when the curve is reasonably smooth [22].

Silhouette Index Method

The silhouette value measures an object's cohesion with its own cluster in comparison to other clusters (Separation). The silhouette plot, as shown in Fig. 2, illustrates the percentage of points in one cluster that are close to points in neighbouring clusters, providing a method for estimating survey boundaries such as the number of clusters that extend beyond them. S(i), the silhouette index, shall be located between [-1, 1]. The score ranges from − 1 for inadvertent grouping to + 1 for extraordinarily dense clustering. Overlapping clusters are visible in scores near zero.

The idea of silhouette width makes a contrast between within-cluster tightness and isolation from the rest [11, 18]. The average silhouette is used to assess a grouping's quality. In compared to other items, it determines how well each thing fits into its cluster. A high average silhouette width indicates a healthy clustering. The definition of a data point's silhouette value is given in Eq. (3.2) as follows:

$${s}\left({i}\right)=\frac{\text{b}\left(\text{i}\right)-a\left(i\right)}{\text{m}\text{a}\text{x}\{a\left(i\right),b\left(i\right)\}}$$

3.2

where the average distance between the ith point and other points in the same cluster is denoted by the letter a(i). B represents the average separation between the ith point and the points in the kth cluster (i,k). When the ith point is not assigned to a cluster, b(i) is the lowest of all b(i, k) over all clusters. The ith point is clearly contained within its own cluster because s(i) is close to one and not more than one. The disadvantage is that for convex clusters, the Silhouette Coefficient is typically greater.

The steps in the silhouette technique are as follows:

Variation of k (number of clusters) from 1 to max is used to run the K-means clustering algorithm.
Calculate the average silhouette of observations for each of the k variables (silavg).
Plot the ‘silavg' curve as a function of the number of clusters (k).
The proper number of clusters is determined by where the maximum is found.

Gap Statistic Method

The best cluster number in the dataset under consideration can be found in the interim using gap statistic approaches. The processes used by the gap statistic methodology [23, 26] to find the likely optimal cluster number are as follows: The right cluster number is determined using the K-means output and a comparison of the change in intra-cluster dispersion.

A method for roughly estimating the "correct" number of clusters, k, for an unsupervised clustering is the gap statistic. As seen in Fig. 3, this is accomplished by assessing an error measurement (within cluster sum of squares) in relation to our choice of k.

The following is how the algorithm works:

1. Cluster the observed data with k = 1,..., k_max clusters and calculate the related total within intra-cluster variation W_k.

2. Create B random uniform distribution reference data sets. Each of these reference data sets is clustered with varied numbers of clusters k = 1,..., k_max, and the associated total within intra-cluster variance W_kb is computed.

3. Calculate the estimated gap statistic as the deviation of the observed W_k value under the null hypothesis as shown in Eq. (3.3)

$$Gap\left(q\right)=\frac{1}{B}{\sum }_{b=1}^{B}\left({\text{log}W}_{kb}-{\text{log}W}_{k}\right)$$

3.3

4. Calculate the statistics' standard deviation as well.

5. Choose the least value of k for the number of clusters so that the gap statistic at k + 1 is within one standard deviation of the gap as shown in Eq. (3.4)

Gap(k) ≥ Gap(k + 1) − s _{k + 1}. (3.4)

Calculating all-pairs distances within each cluster is a drawback. They demonstrate that sampling is sufficient, but it does add some complication. Additionally, it lacks a convex or monotonic capacity, which makes determining the ideal point more challenging. Since it is a "relative" measurement, it is useless for making decisions when there aren't any notable clusters.

Inference

As can be seen, each of the three aforementioned algorithms has a distinct set of characteristics. The simple complexity Elbow Method algorithm uses SSE as a performance metric. By iterating over the K value, the inflection point is determined. The relationship between the K and distance numbers determines the inflection point, which is a flaw. If the inflection point is not obvious, the K value cannot be determined.

The Silhouette Coefficient approach uses cluster cohesion and separation to perform a cluster analysis. Minimizing cohesiveness is similar to maximising separation, combining it with S(i), and crossing the K value. When S(i) is maximal, K is the optimal number of clusters. The problem is that the computational complexity is O(n²) since the distance matrix must be computed, and as a result, the volume of data can be close to one million or even ten million. This strategy is not suggested for big data sets due to the substantial processing overhead.

The Gap Statistic approach compares the predicted value of the averaged reference data set with the observed data set in order to determine the k value that drops the fastest. On the other hand, because to its time and space complexity, this method is not suitable for many real-world data sets. The value of k acquired from all the current methods may occasionally disagree, leading to contradiction for the same dataset, even if there are several validity indices ways to estimate the ideal number of clusters, k, to execute clustering in WSNs. As a result, an effort is made to propose the Variance Difference Method (VDM), an algorithm that executes clustering by predicting the ideal number of clusters K, for the given dataset.

Measuring the variance in the dataset is the fundamental tenet of VDM. The first step in a WSN is to determine the Euclidean distance between the nodes and the base station or sink node, which is denoted by D = (d1, d2,..., dm. Calculations for D's mean and population variance are made. The dataset D is divided into k divisions, where k is between 2 and 10. For I = 1...k, each partition corresponds to a single cluster, Ci. The mean and sample variance are computed for each cluster Ci. The volatility inside a cluster is known as the average sample variance (ASV). The difference between population variance and ASV is what is referred to as cluster variance. Last but not least, variance difference is calculated by deducting variance from ASV for each value of k between 2 and 10. The best K value is defined as having the smallest variance difference.

Pseudocode for VDM

Input : A dataset D with distance points D={ d ₁, d₂, ……, d_m}.

Output : k, Optimal number of clusters.

1. Meanµ$=\frac{1}{m}{\sum }_{i=1}^{m}{d}_{i }$

2. Population Variance PV$=\frac{1}{m}{\sum }_{i=1}^{m}{({d}_{i}-\mu )}^{2}$

3. For k = 2 to 10 do

3.1 Split ← $\frac{1}{k}$

3.2 Start ← 0

3.3 End ← Split

3.4 For j = 1 to K do

3.4.1 Calculate µ_j ( Mean of cluster j )

3.4.2 Find (Sample Variance) SV_j = $=\frac{1}{n-1}{\sum }_{i=1}^{n}{({d}_{i}-{\mu }_{j})}^{2}$ where n is the number of samples in cluster j.

3.4.3 jinc ← jinc + 1

3.4.4 Start ← Start + Split

3.4.5 End ← Start + Split

3.4.6 End for

3.6 Average Sample Variance ASV_i $=\frac{1}{c}{\sum }_{j=1}^{c}{SV}_{ j}$ Where c is the number of clusters formed.

3.7 Variance V_i = PV – ASV_i

3.8 Variance Difference VD_i = ASV_i – V_i

3.9 End for

4. k ← index(min(VD))

To verify the performance of Variance Difference algorithm, this study validates indices such as Elbow method, silhouette coefficient and Gap statistic method. Experiments were conducted on synthetic datasets and sensor network datasets to test VDM and the results were compared with Elbow method, Silhouette coefficient and Gap statistic method.

Synthetic datasets with 3 high variance cluster

A major bottleneck in the majority of current machine learning pipelines is the time-consuming, expensive, and labor-intensive process of data collection and labelling. In several scientific domains, the development of synthetic datasets is a common strategy. These types of data are frequently produced to satisfy unique requirements or circumstances that aren't immediately obvious in the original, raw data. Depending on the application area, the data's nature varies. The most common technique for producing synthetic datasets is to utilise little scripts or programmes that are restricted to solving straightforward issues or running only one type of application.

There are three major processes to creating a new dataset. A default distribution (such as constant, linear, or Gaussian) is used to represent an underlying n-dimensional distribution function in the first phase. This information includes the number of dimensions n, the number of samples m, the number of classes (if any), and the number of dimensions n. Datasets with labelling information that describe a variety of classes that will be represented in the framework by various colours can be defined by the user. Although these classes are merely an additional dimension in the dataset, handling them as separate properties enables the user to individually customise each class's attributes. The desired traits must be incorporated into the grounding distribution in the second stage. Features can be created using the various generator object kinds. The last step is the algorithm's evaluation.

These studies use five symmetric normal distribution-producing 2D synthetic datasets, each with 1000 instances, two features, and three classes. This classification dataset is constructed by taking a multi-dimensional standard normal distribution and splitting it into classes using layered concentric multi-dimensional spheres with nearly equal amounts of samples in each. Figure 4 displays the distribution of a structure from a 2D synthetic dataset with three high variance clusters. The simulation analysis on 2D synthetic datasets with 3 variance clusters is shown in Table 1. Gap statistic failed to correctly identify 3 clusters for all five synthetic datasets, the Elbow technique and Silhouette index method correctly identified 3 clusters for only two synthetic datasets, and the proposed VDM approach correctly identified 3 clusters for three datasets.

Table 1

Simulation Study on 2D Synthetic datasets with K = 3
Dataset	Elbow Method (k_est)	Silhouette Index Method (k_est)	Gap Statistic Method (k_est)	VDM (k_est)
S1	3	2	2	3
S2	2	3	2	3
S3	2	3	2	2
S4	2	2	2	3
S5	3	4	4	2
Overall	2/5	2/5	0/5	3/5

Table 2

Characteristics of Sensor Networks
Dataset	Size	No. of Clusters
AN1	1000	8
AN2	440	6
AN3	100	5
AN4	28	3
AN5	120	3
AN6	640	4
AN7	520	5
AN8	40	3
AN9	800	6
AN10	910	8
AN11	1000	7
AN12	550	5

Table 2 displays 12 simulated datasets that were produced using various numbers of high variance clusters. The experimental results from the four validity indices that were used to determine the right number of groups are presented in Table 3. VDM finds the ideal number of clusters for the eight simulated datasets, while the Elbow approach, the Silhouette index method, and the Gap Statistic Method index all produce successful results for four, six, and five datasets, respectively.

Table 3

Comparison of optimal clusters determined by different methods
Dataset	K (known)	k_est
Dataset	K (known)	Elbow Method	Silhouette Coefficient	Gap Statistic	VDM
AN1	8	3	8	8	8
AN2	6	4	4	4	5
AN3	5	4	5	3	4
AN4	3	3	3	3	3
AN5	3	2	2	2	3
AN6	4	4	2	2	4
AN7	5	2	5	5	4
AN8	3	2	2	3	3
AN9	6	6	6	6	6
AN10	8	4	6	5	8
AN11	7	7	4	4	6
AN12	5	4	5	4	5

The experiments for determining the best clustering numbers for datasets allow us to examine changes in the relative error as described in Eq. (3.5), in the estimation of k supplied by each internal index as

$$Relative Error \left(RE\right)=\frac{\left|k-{k}_{est}\right|}{k}$$

3.5

Where,

The calculated number of clusters is k_est, and the real number of clusters is k.

Table 4

Relative Error
Dataset	Elbow Method	Silhouette Coefficient	Gap Statistic	VDM
AN1	0.62	0	0	0
AN2	0.33	0.33	0.33	0.16
AN3	0.2	0	0.4	0.2
AN4	0	0	0	0
AN5	0.33	0.33	0.33	0
AN6	0	0.5	0.5	0
AN7	0.6	0	0	0.2
AN8	0.33	0.33	0	0
AN9	0	0	0	0
AN10	0.5	0.25	0.37	0
AN11	0	0.42	0.42	0.14
AN12	0.2	0	0.2	0
Average	0.26	0.18	0.21	0.05

Elbow method and Gap Statistic index have relatively high average error rates. Table 4 presents the experimental results of the four validity indices used to analyse differences in the comparative inaccuracy in the assessment of k. However, the suggested method seems to perform very well overall for simulated datasets. The computed error rates for each of the four approaches are shown graphically in Fig. 5.

Adjusted Rand Index

An indicator of a clustering method's quality is the Rand index (RI). We can check to see if the predicted value of k fits the number of clusters well. A key issue in the RI is the fact that the anticipated value of the Rand Index for two random clusters or divisions does not assume a constant value. To solve the problem, the generalised hyper geometric distribution was used as the randomness model to construct the Adjusted Rand Index, as shown in Fig. 6. To evaluate this, the following can be done using the adjusted Rand index (ARI) supplied by each of their individual clusters

A contingency table can be presented in a variety of ways; for example, the columns and rows may be flipped around according to the definition, with the rows representing actual values and the columns representing anticipated values as shown below:

Actual Cluster	Predicted Cluster
	Cluster	C₁	C₂	C₃	…..	C_j	Sum
	C₁	n₁₁	n₁₂	n₁₃	…..	n_1j	a₁
	C₂	n₂₁	n₂₂	n₂₃	…..	n_2j	a₂
	C₃	n₃₁	n₃₂	n₃₃	…..	n_3j	a₃
	…..	…..	…..	…..	…..	…..	…..
	C_i	n_i1	n_i2	n_i3	…..	n_ij	a_i
	Sum	b₁	b₂	b₃	…..	b_j

Contingency table for multi-class classification can also be represented is shown in Table 5. C_k is the total number of clusters in a dataset. C, E, and E' denote the class label, the actual number of examples, and the anticipated number of examples of the class denoted by the subscript, respectively. True Positive (TP) and False Negative (FN) represent the number of instances correctly and erroneously predicted for the class of interest, respectively.

Table 5

Contingency Table
Actual Cluster	Predicted Cluster
	Cluster	C₁	C₂	C₃	…..	C_k	Total
	C₁	TP₁	FN₁	FN₁	…..	FN₁	E₁
	C₂	FN₂	TP₂	FN₂	…..	FN₂	E₂
	C₃	FN₃	FN₃	TP₃	…..	FN₃	E₃
	…..	…..	…..	…..	…..	…..	…..
	C_k	FN_k	FN_k	FN_k	…..	TP_k	E_k
	Total	E’₁	E’₂	E’₃	…..	E’_k	E

Table 6

Adjusted Rand Index
Dataset	Elbow Method	Silhouette Coefficient	Gap Statistic	VDM
AN1	0.19	0.39	0.43	0.62
AN2	0.13	0.15	0.12	0.15
AN3	0.71	0.93	0.86	0.94
AN4	0.12	0.16	0.10	0.54
AN5	0.21	0.26	0.20	0.14
AN6	0.41	0.6	0.00	0.62
AN7	0.33	0.35	0.30	0.43
AN8	0.64	0.61	0.52	0.61
AN9	0.25	0.27	0.34	0.27
AN10	0.10	0.08	0.09	0.10
AN11	0.32	0.30	0.08	0.32
AN12	0.53	0.48	0.47	0.68
Average	0.32	0.38	0.29	0.45

The clustering results obtained by three internal indices and the suggested technique are assessed against the external conditions using the ARI index as a performance assessment. Ø € [1, 1] is the ARI validity threshold (Ø). Poor matching is shown by a value of less than one, and strong matching is demonstrated by a value of more than one.

Figure 7 and Table 6 provide illustrations of the experimental results for the ARI index on various datasets. The ARI index values of the proposed method are higher than those of other indices.

Using a variance-based method called VDM, the number of clusters can be approximated. Similarly resistant to large variance cluster dominance is the VDM method. The provided method for choosing the k parameters is simple and effective. Existing methods pick a dataset's k value by applying a clustering algorithm on it while the user chooses from a range of potential values for the k parameter. The VDM technique, on the other hand, successfully addresses the dominance of high variance clusters by repeatedly performing VDM calculations throughout a dataset with a range of k parameter values.

When there is a lot of data, selecting the appropriate parameter for categorising it might be challenging. Our test results, however, demonstrate that VDM is more dependable than earlier methods. Measurements on the most challenging datasets show that the VDM technique outperforms the state-of-the-art methods in terms of average relative error and average apparent risk (ARI).

The experimental findings indubitably demonstrate that the average relative error for the Elbow technique, Silhouette index method, Gap statistic method, and VDM are 26%, 18%, 21%, and 5%, respectively. The Elbow method's ARI is 32%, the Silhouette index method's is 38%, the Gap statistic method's is 29%, and the VDM method's is 45%. As a result, the performance of the suggested VDM surpasses that of the other three indices.

Data Availability Statements

The datasets generated during and/or analysed during the current study are available in the Sensor Network Datasets : Benchmark datasets repository.

https://github.com/deric/clustering-benchmark/tree/master/src/main/resources/datasets/artificial

Funding

This research received no external funding.

Contributions

The authors equally contributed in the conception and design of the research article.

Conflict of interest

The authors declare no conflict of interest.

Code Availability

The code used in this study will be made available on request. The code will not be Publicly available.

Abbasi, AA & Younis, M 2007, ‘A survey on clustering algorithms for wireless sensor networks’, Computer Communications, vol. 30, issue 14-15, pp. 2826–2841.
Asha Jerlin Manuel, Ganesh Gopal Deverajan, Rizwan Patan & Amir H Gandomi 2020, ‘Optimization of Routing-Based Clustering Approaches in Wireless Sensor Network: Review and Open Research Issues’, Electronics, vol. 9, pp. 1-29.
Benmahdi, MB & Lehsaini, M 2020, ‘Performance evaluation of main approaches for determining optimal number of clusters in wireless sensor networks’, International Journal of Ad Hoc and Ubiquitous Computing, vol. 33, issue 3, pp. 184.
Channamma Patil & Ishwar Baidari, 2019, ‘Estimating the Optimal Number of Clusters k in a Dataset Using Data Depth’, Data Science and Engineering, vol. 4, pp. 132-140.
Dongyao Jia, Huaihua Zhu, Shengxiong Zou & Po Hu 2016, ‘Dynamic cluster head selection method for Wireless Sensor Networks’, IEEE Sensors Journal, vol. 16, no. 8.
Gherbi, C, Aliouat, Z & Benmohammed, M 2017, ‘A survey on clustering routing protocols in wireless sensor networks’, Sensor Review, vol. 37, issue 1, pp. 12–25.
Girija, MS & Tapas Bapu, BR, ‘An Energy Efficient Expert System to Choose Cluster Head from Hybrid Clustering’, Ad hoc & Sensor Wireless Networks, vol. 49, pp. 23–41
Jain, B, Brar, G & Malhotra, J 2018, ‘EKMT-k-means clustering algorithmic solution for low energy consumption for wireless sensor networks based on minimum mean distance from base station’, Networking Communication and Data Knowledge Engineering, Springer, pp. 113–123.
Jan, B, Farman, H, Javed, H, Montrucchio, B, Khan, M & Ali, S 2017, ‘Energy Efficient Hierarchical Clustering Approaches in Wireless Sensor Networks: A Survey’, Wireless Communications and Mobile Computing, vol. 2017, pp. 1–14.
Jia, D, Zhu, H, Zou, S & Hu, P 2016, ‘Dynamic cluster head selection method for wireless sensor network’, IEEE Sensors Journal, vol. 16, issue 8, pp. 2746–2754.
Kaufman, L & Rousseeuw, P 1990, ‘Finding Groups in Data: An Introduction to Cluster Analysis’, John Wiley & Sons, Inc, Hoboken, New Jersey, USA.
Kingrani, Suneel Kumar, Levene, Mark & Zhang Dell, 2018, ‘Estimating the number of clusters using diversity’, Artificial Intelligence Research, vol. 7, issue 1, pp. 15-22.
Kingrani, Suneel Kumar, Levene, Mark & Zhang Dell, 2018, ‘Estimating the number of clusters using diversity’, Artificial Intelligence Research, vol. 7, issue 1, pp. 15-22.
Logambigai, R & Kannan, A 2016, ‘Fuzzy logic based unequal clustering for wireless sensor networks’, Wireless Networks, vol. 22, pp. 945–957.
Mayur V Bhanderi & Hitesh B Shah 2014, ‘Machine Learning for Wireless Sensor Network: A Review, Challenges and Applications’, Advance in Electronic and Electric Engineering, vol. 4, issue 5, pp. 475-486.
Mezouary, RE, Choukri, A, Kobbane, A & Koutbi, ME 2016, ‘An energy-aware clustering approach based on the K-means method for wireless sensor networks’, Advances in Ubiquitous Networking, Springer, pp. 325–337.
Molay, Z, Akbari, R, Shokouhifar, M & Safaei, F 2016, ‘Swarm intelligence based fuzzy routing protocol for clustered wireless sensor networks’, Expert Systems with Applications, vol. 55, pp. 313–328.
Pollard, KS & Van Der Laan, MJ 2012, ‘A Method to Identify Significant Clusters in Gene Expression Data’, U.C. Berkeley Division of Biostatistics Working Paper Series, p. 107.
Praveen Kumar, D, Amgoth, T & Annavarapu, CSR 2019, ‘Machine learning algorithms for wireless sensor networks: A survey’, Information Fusion, vol. 29, pp. 1-25.
Purnima Bholowalia & Arvind Kumar 2014, ‘EBK-Means: A Clustering Technique based on Elbow Method and K-Means in WSN’, International Journal of Computer Applications, vol. 105, issue 9, pp. 17-24
Rostami, AS, Badkoobe, M, Mohanna, F, keshavarz, H, Hosseinabadi, AAR & Sangaiah, AK 2017, ‘Survey on clustering in heterogeneous and homogeneous wireless sensor network’, The Journal of Supercomputing, vol. 74, issue 1, pp. 277–323.
Shi, C, Wei, B & Wei, S 2021, ‘A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm’, EURASIP Journal on Wireless Communication and Networking, vol. 2021.
Tibshirani, R, Walther, G & Hastie, T 2001, ‘Estimating the number of clusters in a data set via the gap statistic’, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, issue 2, pp. 411–423.
Tripathi, A, Gupta, HP, Dutta, T, Mishra, R, Shukla, KK & Jit, S 2018, ‘Coverage and Connectivity in WSNs: A Survey, Research Issues and Challenges’, IEEE Access, vol. 6, pp. 26971–26992.
Xu, L, Collier, R & O’Hare, GMP 2017, ‘A Survey of Clustering Techniques in WSNs and Consideration of the Challenges of Applying Such to 5G IoT Scenarios’, IEEE Internet of Things Journal, vol. 4, issue 5, pp. 1229–1249.
Yuan, C & Yang, H 2019, ‘Research on K-Value Selection Method of K-Means Clustering Algorithm’, J Journal, vol. 2, issue 2, pp. 226–235.

Download PDF

Editorial decision: Major revisions
07 May, 2024
Reviewers agreed at journal
20 Aug, 2023
Reviewers invited by journal
17 Jan, 2023
Editor assigned by journal
24 Aug, 2022
First submitted to journal
23 Aug, 2022

You are reading this latest preprint version

A Variance Difference Method for Determining Optimal Number of Clusters in Wireless Sensor Networks

Status:

Version 1

Abstract

Figures

I. Introduction

Ii. Related Work

Iii. Proposed Variance Difference Method (Vdm)

Iv. Experimental Results And Discussions

V. Conclusion

Declarations

Data Availability Statements

Funding

Contributions

Conflict of interest

Code Availability

References

Status:

Version 1