Implement overlapping clustering using K means where the points can belong to zero, one or more clusters

doi:10.21203/rs.3.rs-3958267/v1

Download PDF

Research Article

Implement overlapping clustering using K means where the points can belong to zero, one or more clusters

https://doi.org/10.21203/rs.3.rs-3958267/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Clustering is a fundamental technique in data analysis and pattern recognition, aiming to group similar data points into distinct clusters. Traditional clustering algorithms assign each data point to a single cluster, assuming that the underlying structure of the data is non-overlapping. However, real-world datasets often exhibit complex relationships, where data points may belong to multiple clusters simultaneously. This work explores the concept of overlapped clustering, wherein data points are allowed to participate in more than one cluster, accommodating the inherent flexibility and diversity present in various datasets.

In this work, we modify k-means to handle such situations where a point ‘A’ can belong to zero, one or more than one group. We propose a new way to use k-means that lets things be part of different groups. This helps show complex relationships in data. By leveraging the well-established foundation of the k-means algorithm, this work contributes a pragmatic solution to the challenge of overlapped clustering.

We test our modified method on different data, showing it can find overlapping groups better than regular k-means. This gives a simpler way to tackle overlapping clusters in data analysis, making it easier to uncover hidden patterns in today's complex datasets.

Overlapping clustering

radius threshold

K-means algorithm

flexible data clustering

data pattern recognition

real-world dataset analysis

Clustering, a fundamental technique in data analysis and machine learning, plays a pivotal role in unveiling hidden structures within datasets. This work embarks on the enhancement of the widely adopted K-means clustering algorithm, a cornerstone in the field, to address the nuanced challenges posed by real-world datasets. The inherent complexity and diversity of data often necessitate a more flexible approach, prompting

the exploration of overlapping clustering. Unlike exclusive clustering, where each data point rigidly belongs to a single cluster, overlapping clustering allows for the simultaneous membership of data points in multiple clusters, acknowledging the intricate relationships present in complex datasets.

The versatility of clustering techniques, from exclusive to overlapping and hierarchical approaches, empowers analysts to discern patterns, relationships, and structures. Among these, K-means clustering stands out for its simplicity, efficiency, and effectiveness in identifying distinct, well-separated clusters. However, its conventional application faces limitations when dealing with datasets exhibiting overlapping characteristics. This work seeks to extend the utility of the K-means algorithm by introducing a radius threshold, enabling it to accommodate overlapping clusters more adeptly. This modification captures subtle relationships in diverse datasets. With a focus on overlapping clustering, our approach offers a promising avenue for uncovering intricate patterns.

Sun et al. (2023) lay a robust foundation within clustering algorithms, conducting a comprehensive analysis of K-means and comparing it with K-DBSCAN, offering insights into its strengths and limitations. Chi (2021) extends its application to student achievement analysis, showcasing its versatility. Setiawan et al. (2023) contribute insights by exploring hospital clustering using K-means and Fuzzy C-means, highlighting computational expenses and sensitivity to initialization. Wu et al. (2021) propose an improved K-means algorithm based on density normalization, addressing challenges in

handling diverse datasets. Shen and Duan's (2020) application in teaching satisfaction data provides domain-specific insights. The sequence, ordered by relevance, leads to the proposed novel methodology, placing emphasis on Wu et al.'s (2021) innovative density normalization approach. This literature review not only comprehensively explores existing knowledge but also establishes a strong foundation for the proposed method, incorporating a radius threshold for overlapping clustering within K-means.

In summary, our approach to overlapping clustering with K-means and a radius threshold offers a clear and efficient alternative to fuzzy clustering methods. It provides unambiguous assignments without the computational complexities often associated with fuzzy clustering, especially in handling large datasets. The simplicity and interpretability of our approach make it a practical choice, overcoming challenges typically associated with fuzzy clustering methods.

K-means Implementation: A Brief Overview

The K-means algorithm is the most popular and widely used clustering technique characterized by a straightforward yet effective methodology. Beginning with random cluster centres, it iteratively assigns data points to the nearest cluster and updates the cluster centre based on assigned points. This process repeats until stability. The algorithm strives for simplicity, relying on mean calculations and distance metrics. Despite its straightforwardness, its wide application stems from its efficiency and effectiveness in revealing natural groupings within datasets.

Enhancing K-means for Overlapping Clustering

Algorithmic Flow: Overlapping K-means Clustering:

Step 1: Initialization

Randomly initialize K cluster centres.

Step 2: Assignment

For each data point xi, calculate its distance to all cluster centres.
Assign xi to the nearest cluster if the distance is within the traditional K-means radius.

Step 3: Overlapping Assignment

For each data point xj not assigned in Step 2, check if it is within radius R of any cluster centre.
If yes, assign xj to the respective cluster(s).

Step 4: Update Centres

Recalculate the cluster centres based on the data points assigned in Steps 2 and 3.

Step 5: Convergence Check

Repeat Steps 2-4 until convergence, i.e., minimal change in cluster assignments or reaching a set number of iterations.

Step 6: Output the Clusters.

This algorithm extends the traditional K-means by introducing a radius threshold for overlapping assignments, allowing data points to belong to multiple clusters. The process iterates until convergence, ensuring optimal cluster assignments.

Below is the flow-chart for the implementation of Overlapping clustering by extending K means by introducing radius threshold:

Overlapping Clustering Implementation

Step 1: Pre-process the Dataset and Perform PCA

Effective overlapping clustering relies on thorough dataset pre-processing. This involves converting categorical variables to numerical formats, addressing missing values through filling or removal strategies, and standardizing variables for unbiased clustering. The data is then subjected to Principal Component Analysis (PCA) for dimensionality reduction and optimal clustering outcomes. This meticulous pre-processing sets the stage for accurate and insightful overlapping clustering using K-means with a radius threshold.

Step 2: Apply K-means on Pre-processed Dataset

The K-means clustering algorithm is then applied to the pre-processed dataset, requiring two key inputs:

Value of K: Which is the number of clusters to be formed. Training set (n): {x1, x2, x3, ……xn}

Selecting the Number of Clusters (K): The determination of the optimal value for K is crucial. The elbow method is a popular technique used for this purpose. The goal during the elbow curve analysis is to find the point where increasing the number of clusters leads to a significant reduction in WSS, indicating a more compact clustering solution.

Mathematically, the equation for WSS is:

Where: k is total number of clusters, ni is the number of data points in the i-th cluster, xij

denotes the j-th data point in the i-th cluster and ci is the centroid of the i-th cluster.

Initialization: Randomly initialize K points called cluster centroids. Centroids are randomly placed within the data range.

Cluster assignment: Compute the distance between data points and clustercentroids. Based on the minimum distance, divide data points into K groups. Euclidean Distance between two points in space is given by the formula:

Where, p = (p1, p2) and q = (q1, q2) are co-ordinates of cluster centroid anddata points in dataset.

Move Centroids: Find the mean of all the data points of a cluster then relocatethe cluster centroid to that mean.

Step 4: Cluster overlapping based on radius threshold:

This step is where the overlapping clustering methodology comes into play. Instead of strictly assigning each data point to a single cluster, the code checks the distance of each data point to all cluster centres. If a data point is within a certain distance (defined by radius threshold) of a cluster centre, it's considered part of that cluster. This enables overlapping membership, meaning one data point can belong to multiple clusters. This step involves nested loops. For each data point, it iterates through all cluster centres to calculate distances.

Step 5: Cluster Visualization, statistics estimation and result tabulation:

Post-clustering, a scatter plot visually represents the clustered data, using distinct colours to denote cluster affiliations, offering an immediate grasp of cluster distribution and potential overlaps. The subsequent step involves computing fundamental statistics for every attribute within each cluster, comprising data point count, mean, standard deviation, and minimum and maximum values, yielding insights into cluster characteristics and variations. Concluding this process, a CSV file is generated, systematically capturing attributes and cluster-assigned details for each data point, facilitating effective documentation of clustering outcomes.

Step 6: Interpreting the visualizations generated from clustering analysis:

This step involves interpreting the visualizations generated from clustering analysis, providing valuable insights into the characteristics and relationships between the clusters. The scatter plot helps assess cluster separation and identifies outliers or points far from cluster centres. Feature Distribution Box Plots highlight significant feature variations across clusters, aiding in distinguishing them. These plots also detect outliers, revealing unique patterns. Cluster Statistics provide key attribute statistics, offering insights into typical values and variability within each cluster.

The application of overlapping clustering revealed hidden connections that traditional methods might have missed, offering a detailed examination of how data points share similarities. In the pursuit of comprehensive data analysis, our application of overlapping clustering techniques has yielded profound insights into intricate relationships within diverse datasets. The implementation of scatter plots, box plots, and statistical summaries has allowed for a multi-faceted exploration of clustering patterns and characteristics.

Elbow Curves for K:

Scatter Plots:

Through meticulous examination of scatter plots, we observed the nuanced affiliations of data points, unravelling subtle connections that might elude traditional clustering methods. The visual representation not only provided clarity in cluster demarcation but also facilitated the identification of points with multiple affiliations, contributing to a holistic understanding of complex interdependencies.

Box Plots:

Box plots played a pivotal role in summarizing the distributional aspects of each cluster, offering a succinct yet informative portrayal of cluster characteristics. The detection of

outliers, those data points deviating significantly from the norm, was made efficient, providing a basis for anomaly identification and in-depth analysis.

Cluster Statistics:

CSV Outputs:

Notable findings emerged from the clustering analysis of the stock market dataset. "Vodafone Idea" and "MRF" stood outside clusters, reflecting unique financial

characteristics. Moreover, "Reliance" and "Titan Company" were observed to span multiple sectors, hinting at diversified operations. Additionally, risk-return profiles within the clusters were discerned, with Cluster 0 featuring expensive yet profitable stocks, Cluster 1 displaying higher dividends and ROE, and Cluster 2 emerging as the most expensive, leveraged, and profitable cluster. Shifting focus to the customer transaction dataset, the overlapping clustering technique provided valuable insights into credit score categorizations. This nuanced approach facilitated the grouping of customers into "good," "average," and "poor" credit score holders. Such detailed classifications offer a comprehensive understanding of the diverse financial behaviours and patterns within these distinct groups, contributing to more informed decision-making processes.

Overall, our results demonstrate the efficacy of overlapping clustering in uncovering intricate connections within datasets, providing a deeper and more nuanced understanding of the complex structures inherent in stock market dynamics and customer transaction behaviours.

Advantages over other Overlapping Clustering techniques like Fuzzy C-means:

This method introduces several advantages over existing overlapping clustering techniques. Firstly, it ensures clear and unambiguous assignments of data points to clusters simplifying interpretation and decision-making. The K-means algorithm employed in this approach is not only conceptually simpler but also computationally more efficient compared to certain fuzzy clustering methods. Its robustness to noise in the data sets it apart, as K-means demonstrates less sensitivity to outliers, providing more reliable cluster assignments. Additionally, the scalability of K-means proves beneficial for large datasets, ensuring efficient processing even with substantial data volumes. The method's

flexibility is enhanced by the introduction of a customizable radius threshold, allowing fine-tuning of overlapping regions based on specific dataset characteristics, providing control and adaptability.

In this study, we explored the realm of clustering, with a specific focus on the innovative concept of overlapping clustering. Our investigation covered two diverse datasets: Stock market, and Customer transaction details. Prior to applying the traditional K-means clustering as a baseline, we meticulously prepared the data and selected relevant attributes. The significant contribution of this work lies in introducing and applying overlapping clustering, allowing data points to have multiple affiliations. This novel approach enhanced our understanding of complex dataset structures, as demonstrated in the analysis of the stock market and customer transaction datasets. Insights gained from exploring each cluster emphasized the effectiveness of overlapping clustering in revealing intricate patterns and behaviours within groups of datapoints. This unique capability to uncover complex relationships, especially when data points belong to more than one cluster, sets overlapping clustering apart from traditional methods.

Author Contribution

D.R, S.M conceptualized and designed the study. I.S.S , S.H.G , V.H.M conducted the experiments and gathered the data. Data analysis and interpretation were carried out by I.S.S, S.H.G and V.H.M. S.H.G drafted the main manuscript text.All authors actively participated in the critical review and final approval of the manuscript. This collaborative effort ensures that each author's contribution is acknowledged, and collectively, we take accountability for the content of this publication. I.S.S - Imaad Salim ShaikhS.H.G – Sudhanva H GV.H.M – Vinay H M D.R – Dinesh RaoS.M – Sathyendranath Malli

S. Sun, K. Lei, Z. Xu, W. Jing, and G. Sun, "Analysis of K-means and K-DBSCAN Commonly Used in Data Mining," 2023 International Conference on Intelligent Media, Big Data and Knowledge Mining (IMBDKM), Changsha, China, 2023.
Karli Eka Setiawan, Afdhal Kurniawan, Andry Chowanda, Derwin Suhartono, "Clustering models for hospitals in Jakarta using fuzzy c-means and k-means," Procedia Computer Science, Bina Nusantara University, Jakarta, Volume 216, 2023.
X. Wu, Z. Chen, S. Yuan, J. Wei, and X. Wang, "An improved k-means algorithm based on density normalization," 2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 2021.
Sina Khan Mohammadi, Naiier Adibeig, and Samaneh Shanehbandy. 2017. "An improved overlapping k-means clustering method for medical applications." Expert Syst. Appl. 67, C January 2017.
"Seismic attribute selection and clustering to detect and classify surface waves in multicomponent seismic data by using k-means algorithm" by Ivan Sánchez Galvis, Yenni Villa, César Duarte, Daniel Sierra, and William Agudelo. The Leading Edge 2017.
H. Shen and Z. Duan, "Application Research of Clustering Algorithm Based on K- Means in Data Mining," 2020 International Conference on Computer Information and Big Data Applications (CIBDA), Guiyang, China, 2020.
D. Chi, "Research on the Application of K-Means Clustering Algorithm in Student Achievement," 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 2021.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Implement overlapping clustering using K means where the points can belong to zero, one or more clusters

Status:

Version 1

Abstract

Figures

Introduction

LITERATURE REVIEW

Proposed Methodology

K-means Implementation: A Brief Overview

Enhancing K-means for Overlapping Clustering

Overlapping Clustering Implementation

RESULTS AND DISCUSSIONS

Advantages over other Overlapping Clustering techniques like Fuzzy C-means:

CONCLUSION

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1