A Dynamic K-means Based Clustering Algorithm Using Fuzzy Logic for CH Selection and Machine Learning Based Data Transmission

Clustering is effective method to increase network lifetime, energy efficiency, and connectivity of Sensor nodes in wireless sensor network. An energy efficient clustering algorithm has been proposed in this paper. Sensor nodes are clustered using K-means algorithm which dynamically forms number of clusters in accordance with number of alive nodes. Selection of suitable CH is done by fuzzy inference system by choosing three fuzzy input variable such as residual energy of Sensor node, its distance from cluster center and base station. Amount of data transmitted by member nodes to CH is reduced by machine learning that classify similar data at regular interval. The simulation results show that proposed algorithm outperforms other cluster based algorithms in terms of data received by base station, number of alive node per round, time of first node, middle node and last node to die for various density of sensor nodes and scalable conditions.

Hierarchical cluster based routing algorithm partitions SNs into number of clusters. Each cluster has a cluster head (CH) and number of member nodes. Member nodes transmit sensed data to their CH. After receiving data from all member nodes CH performs data aggregation and fusion to reduce amount of data. Then CH transmit data to base station (BS). Fig. 1 shows cluster based routing of data towards BS. The proposed approach of developing the energy efficient clustering protocol is driven by following questions:  What is a good clustering?
 How to find optimal number of clusters to be formed?
 What factors should be considered for selection of CH?  How to reduce data transmission by member nodes to CH?
For good clustering intra-cluster similarities should be high and inter-cluster similarity must have to be low. Number of clusters to be formed should be in accordance with number of alive node. Factors that should be considered for determination of CH are location of the SN relative to BS and center of cluster, its residual energy, confidence and trust. To reduce overhead on message transmission from member nodes to CH similar pattern in sensed data must be identified and only one copy should be transmitted for every set of similar data.
In this paper, authors have proposed a new protocol for wireless sensor network that uses dynamic K-means algorithm for efficient clustering and optimal number of clusters formation. Selection of suitable CH for each cluster is determined by fuzzy inference system (FIS) that considers residual energy, distance from center of cluster and distance from BS of each SN. Machine learning is used by member nodes to find similar reading in sensed data. All distinct readings and one reading for every set of similar reading are forwarded to CH from member nodes. It results in reduction in data transmission from member nodes to CH. Result of simulations show that proposed protocol has significantly improved network lifetime of WSN.
Rest of the paper is organized as follows: Section 2 presents the related work in the field of cluster based protocols. Section 3 describes energy model adopted. Section 4 defines Methodology used for construction of proposed routing algorithm. Section 5 shows performed simulations in MATLAB and their results. Finally, section 6 provides conclusion and scope of future work.

Literature Review
In this section, most of well-known routing protocols have been discussed. Heinzelman et al. [14] have proposed a Low-energy adaptive clustering hierarchy (LEACH) protocol that introduced the concept of clusters in WSN. It is based on probabilistic model and each node has equal probability to become CH. Process of routing of data is simple and does not require much information. Major disadvantages of LEACH are: (i)Residual energy of SN is not considered in choosing CH. (ii) Clusters formed are not uniform.Heinzelman et al. [15] have described LEACH-C protocol that uses a centralized control technique that uses location information of the s. Base station forms clusters on the basis of SNs current location and energy level resulting in more balanced clusters formed by using the LEACH algorithm. Lindsey and Raghavendra [16] have proposed a Power-Efficient Gathering in Sensor Information Systems (PEGASIS). It uses greedy algorithm to organize SNs in form of a chain. Each node receive from and transfer data to its close neighbor. Fan and Song [17] have presented a Multi-hop LEACH (M-LEACH) protocol for multi-hop communication that takes scalability into consideration. Its negative aspect is that it can not be implemented in heterogeneous sensor network. Beiranvand et al. [18] have developed Improved LEACH (I-Leach) that select CH based on minimum distance from BS, larger remaining energy and more number of neighbors. For cluster formation and data transmission this algorithm considers the distance of SN from CH as well from BS. If BS is nearer to SN it send data directly to BS instead of CH. Yassein et al. [19] have described Vice-LEACH (VLEACH) algorithm that vice-CH in addition to CH, and member nodes.
Vice-CH takes the responsibility of CH when it dies. The major flaw of this algorithm is that if vice CH Dies, it does not provide solution for this condition. Rabiaa et al. [20] have proposed an algorithm that uses K-means clustering using Davies-Bouldin index which is ratio of within-cluster and between-cluster distances. For optimal clustering value of Davies-Bouldin index must be as low as possible. Then Gaussian elimination algorithm is used to select the CH. Jerbi et al. [21] have developed Orphan-LEACH (O-LEACH) that aims to cover SNs which do not belong any cluster due to far away deployment. A cluster member perform the role of a gateway or CH that allows the joining of orphan nodes. Rajput and Kumaravelu [7] have used fuzzy c-means clustering for cluster formation. Selection of CH is based on level of centrality of a node in the cluster. Fuzzy c-means can not be used when number of clusters to be formed are not fixed. Kim et al. [22] have developed (CHEF) that uses fuzzy based approach to select CH. It is based on two parameters that are proximity distance and energy. Locally optimal node with high energy is elected as CH. In [23][24][25] some protocols have been described using fuzzy techniques. Machine learning (ML) techniques are very useful in WSN to reduce amount of data transmitted among SNs. It learn from their surroundings and based on their learning knowledge nodes transfer data to other nodes [26]. In Supervised learning, known input and their output are provided for learning purpose. This knowledge is used to predict result for unseen inputs. In unsupervised learning similarity in input data is used to classify them into different classes.
Review of literature summarized that different algorithms use discrete parameters like location of SNs, inter-cluster distance, residual energy, distance from BS and number of neighbours for CH selection but integrated approaches are not presented. The existing strategies suffers from significant overhead in data transmitted from member nodes to CH.
To overcome above issues an energy efficient dynamic K-means based protocol clustering approach for WSN has been proposed. The prime objective of this research is to increase network lifetime with selection of suitable CH by fuzzy inference system and reduction in data transmission from member nodes to CH by machine learning.

Radio energy model
Radio energy model is used for computation of energy dissipation during data transmission between transmitter and receiver. In WSN, data transmission consumes more energy than data processing [7].
SNs wirelessly transmit their data over a short range. Free space propagation model and multipath fading channel model [7,27] shown in Fig. 2 have been used in proposed work.

Fig. 2. Radio communication model
The dissipation of energy of the SN is calculated for the following operations: (i) Transmission of data from cluster member to CH Dissipation of energy for transmission of data from cluster member to CH is given by eq. (1): Here, is total energy required to transmit k bits of data from cluster member to CH.
is the energy consumption of electronic circuit of SNs. and are energy required by amplifier at transmitter end for free space propagation and multipath fading channel model respectively .
is the distance between cluster member and CH.
0 is reference distance calculated by eq. (2): The data transmitted by all cluster member is received at CH's receiver circuit. is the total energy required receive k bit of data from a cluster member node ( ) is computed by eq. (3): (iii) Transmission of data from CH to BS CH aggregate the data received from all its cluster member and transmit it to BS. Amount of energy needed by a CH for aggregation and transmission is given by eq. (4): Here, is the total energy required to transmit K bits of data from CH to BS.
is the energy required for data aggregation.
is the distance between CH and BS.

K-means Clustering Algorithm
K-means algorithm partitions a set of n object into k clusters based on similarities of objects [20]. It starts with randomly choosing k number of objects each of that initially represents a cluster mean or center. Then each of the remaining objects is assigned to the cluster having identical properties, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster.
This process repeats until the square-error criterion function given by eq. (7) converges to be optimal.
Here E is the sum of the square error for all objects in the data set, p represents an object and mi is the mean of characteristics of cluster Ci. Criterion function is used to make the resulting k clusters compact and distinct. The algorithm determines k partitions that minimize the criterion function resulting in compact and distinct clusters.

Fuzzy logic model
The fuzzy logic model shown in Fig. 4 Fig. 4(a). Membership function for Residual Energy Fig. 4(b) and shows membership for fuzzy set distance from cluster center (DCC). Near, medium and far are chosen as linguistic variable for this fuzzy set. The third fuzzy input variable is distance from BS (DBS). Its membership function has been shown by Fig. 4(c). Near, medium and far are considered as range of values for this fuzzy set. Fig. 4(b). Membership function plot for Distance from Cluster Center Fig. 4(c). Membership function plot for Distance from Base Station 2. Rule base and inference engine Fuzzy inference system that has been considered uses 45 rules and Some of these are shown in Table 1  strong and very strong are linguistic variables for this fuzzy set. The chance of a node to become CH is calculated by considering input parameters such as residual energy, distance from center of cluster and distance from BS by using fuzzy rules.

Machine learning model
Intel lab data set has been considered for classification with machine learning (ML) using python. A sample set of sensor reading for certain period has been taken for training and classification. Three attributes namely time, mote id and humidity has been selected from data set. A new variable "similarity" has been appended in the dataset. It contains value "similar" or "dissimilar". Classifier has been evaluated using following factors: Precision factor: It is positive predictive value that indicates how good a model is at predicting the positive class. It is calculated as proportion of True Positive and Predicted Yes.
Recall factor: gives a measure of how correctly model is able to identify the relevant data. It is proportion of True Positive and Actual Yes. An analysis of humidity data from considered data subset for certain time period is presented in Fig.6 of p4.
Count shows number of occurrence of a humidity value. One reading is transmitted for every set of similar reading. This approach significantly reduce number of readings transmitted by member nodes to CH. It results in significant saving of energy of member nodes and increased network lifetime.

Fig. 6.
Analysis of humidity data in the data subset Table 2 shows the parameters of Random Forest classifier that has been evaluated using python.

Proposed algorithm (DKFM)
A Dynamic K-means based clustering algorithm using fuzzy logic for CH selection and machine learning based data transmission (DKFM) has been proposed on the basis of outcomes from literature review. This protocol uses dynamic K-means algorithm to form optimal number of clusters and reduction of intra cluster distance. A fuzzy inference system selects suitable CH by considering three fuzzy input variable (i) residual energy of SN (ii) distance of SN from cluster center (iii) distance of SN from base station. Amount of data transmitted by member nodes to CH is reduced by machine learning that classify similar data at regular interval. Following assumptions has been taken for proposed  Network is homogeneous.
 All SNs and BS are stationary.
 Each node knows its residual energy and current position.
 All nodes are able to send the data to the BS.
The procedure of proposed routing protocol (DKFM) is as follows: Step 1. Clustering using dynamic K-means K-means algorithm is executed on target WSN having n nodes. It selects number of clusters to be formed (K) dynamically for each round by eq. (10) K = sqrt (initial nodes-dead nodes).
First it randomly selects k out of n nodes as the initial center. Each of the remaining nodes decides its cluster center nearest to it according to the Euclidean distance. After each of the nodes in the network is assigned to one of k clusters, the center of each cluster is calculated and all objects are reassigned using the updated cluster center calculated by eq. (11).
Center (x,y)=(1/n ∑ Xi This process is recursively executed until clusters formed in current round are identical as those formed in the previous round.
Step 2. Fuzzy based selection of CH After the formation of clusters, FIS described in section 4.2 is used to select CH for each cluster. Then node selected as CH broadcast its status to all other nodes in the cluster.
Step 4. Schedule Creation The selected CHs create TDMA schedule to define the time slot for each member in its cluster to forward data to it.
Step 5. Machine learning based data transmission All cluster member send data to their CH using machine learning (described in section 4.3) in their allocated time slot. CH aggregates the received data from all member nodes and sends it to BS.
Step 6. Count dead nodes and alive nodes. If (alive node>0) start new round.
The above procedure has been represented by a Fig. 7flow chart (Fig. 7.) and Algorithm 1.

Simulation results and discussion
The proposed protocol(DKFM) is compared with LEACH [14] and I-LEACH [18] in terms of network lifetime, number of alive node per round, data received by base station, time of first node, middle node and last node to die. MATLAB R2016a tool is used to implement LEACH, I-LEACH and proposed protocol. Table denotes the various network simulation parameters and their values that have been considered [12,28].

Table 8
Network lifetime on the basis of NDF, NHD, NDL for sensing area=100 m×100 m, no of SNs=150 Table 11 Network lifetime on the basis of NDF, NHD, NDL for sensing area=200 m×250 m, no of SNs=150 Proposed protocol shows considerable improvement in network lifetime than two conventional in all scenarios Fig. 10 (a-d).  Fig. 12 (a-c). Effect of increase in node density on average time of NDF, NHD and NDL for simulated protocol Fig. 13 (a-c). Effect of increase in size of sensing area on average time of NDF, NHD and NDL for simulated protocol

Conclusion
In this work A Dynamic K-means based clustering algorithm using fuzzy logic for CH selection and machine learning based data transmission (DKFM) for wireless sensor network has been proposed.
It forms the optimum number of clusters using a dynamic K-means clustering such that intra cluster data transmission distance of SNs are reduced. A fuzzy inference system has been used to select suitable CH considering three fuzzy input variable such as residual energy of SN, its distance from cluster center and base station. Amount of data transmitted by member nodes to CH has been reduced by machine learning that classify similar data at regular interval. In future performance of proposed algorithm will be compared using other network simulator. Further it can be extended for heterogeneous network having mobile SNs and BS to gain more flexibility in real time applications.
Ethical approval-