K-means Implementation: A Brief Overview
The K-means algorithm is the most popular and widely used clustering technique characterized by a straightforward yet effective methodology. Beginning with random cluster centres, it iteratively assigns data points to the nearest cluster and updates the cluster centre based on assigned points. This process repeats until stability. The algorithm strives for simplicity, relying on mean calculations and distance metrics. Despite its straightforwardness, its wide application stems from its efficiency and effectiveness in revealing natural groupings within datasets.
Enhancing K-means for Overlapping Clustering
Algorithmic Flow: Overlapping K-means Clustering:
Step 1: Initialization
- Randomly initialize K cluster centres.
Step 2: Assignment
- For each data point xi, calculate its distance to all cluster centres.
- Assign xi to the nearest cluster if the distance is within the traditional K-means radius.
Step 3: Overlapping Assignment
- For each data point xj not assigned in Step 2, check if it is within radius R of any cluster centre.
- If yes, assign xj to the respective cluster(s).
Step 4: Update Centres
- Recalculate the cluster centres based on the data points assigned in Steps 2 and 3.
Step 5: Convergence Check
- Repeat Steps 2-4 until convergence, i.e., minimal change in cluster assignments or reaching a set number of iterations.
Step 6: Output the Clusters.
This algorithm extends the traditional K-means by introducing a radius threshold for overlapping assignments, allowing data points to belong to multiple clusters. The process iterates until convergence, ensuring optimal cluster assignments.
Below is the flow-chart for the implementation of Overlapping clustering by extending K means by introducing radius threshold:
Overlapping Clustering Implementation
Step 1: Pre-process the Dataset and Perform PCA
Effective overlapping clustering relies on thorough dataset pre-processing. This involves converting categorical variables to numerical formats, addressing missing values through filling or removal strategies, and standardizing variables for unbiased clustering. The data is then subjected to Principal Component Analysis (PCA) for dimensionality reduction and optimal clustering outcomes. This meticulous pre-processing sets the stage for accurate and insightful overlapping clustering using K-means with a radius threshold.
Step 2: Apply K-means on Pre-processed Dataset
The K-means clustering algorithm is then applied to the pre-processed dataset, requiring two key inputs:
Value of K: Which is the number of clusters to be formed. Training set (n): {x1, x2, x3, ……xn}
Selecting the Number of Clusters (K): The determination of the optimal value for K is crucial. The elbow method is a popular technique used for this purpose. The goal during the elbow curve analysis is to find the point where increasing the number of clusters leads to a significant reduction in WSS, indicating a more compact clustering solution.
Mathematically, the equation for WSS is:
Where: k is total number of clusters, ni is the number of data points in the i-th cluster, xij
denotes the j-th data point in the i-th cluster and ci is the centroid of the i-th cluster.
Initialization: Randomly initialize K points called cluster centroids. Centroids are randomly placed within the data range.
Cluster assignment: Compute the distance between data points and clustercentroids. Based on the minimum distance, divide data points into K groups. Euclidean Distance between two points in space is given by the formula:
Where, p = (p1, p2) and q = (q1, q2) are co-ordinates of cluster centroid anddata points in dataset.
Move Centroids: Find the mean of all the data points of a cluster then relocatethe cluster centroid to that mean.
Step 4: Cluster overlapping based on radius threshold:
This step is where the overlapping clustering methodology comes into play. Instead of strictly assigning each data point to a single cluster, the code checks the distance of each data point to all cluster centres. If a data point is within a certain distance (defined by radius threshold) of a cluster centre, it's considered part of that cluster. This enables overlapping membership, meaning one data point can belong to multiple clusters. This step involves nested loops. For each data point, it iterates through all cluster centres to calculate distances.
Step 5: Cluster Visualization, statistics estimation and result tabulation:
Post-clustering, a scatter plot visually represents the clustered data, using distinct colours to denote cluster affiliations, offering an immediate grasp of cluster distribution and potential overlaps. The subsequent step involves computing fundamental statistics for every attribute within each cluster, comprising data point count, mean, standard deviation, and minimum and maximum values, yielding insights into cluster characteristics and variations. Concluding this process, a CSV file is generated, systematically capturing attributes and cluster-assigned details for each data point, facilitating effective documentation of clustering outcomes.
Step 6: Interpreting the visualizations generated from clustering analysis:
This step involves interpreting the visualizations generated from clustering analysis, providing valuable insights into the characteristics and relationships between the clusters. The scatter plot helps assess cluster separation and identifies outliers or points far from cluster centres. Feature Distribution Box Plots highlight significant feature variations across clusters, aiding in distinguishing them. These plots also detect outliers, revealing unique patterns. Cluster Statistics provide key attribute statistics, offering insights into typical values and variability within each cluster.