The Kmeans algorithm first randomly selects K objects as cluster centers, then assigns the sample points to the class with the closest centroid according to the Euclidean distance, and finally calculates the mean of the sample points in the class, and updates the centroids until the results converge.
The specific steps of the algorithm are as follows:
-
Randomly generate K centroids;
-
Calculate the distance between all points in the sample and a random centroid, and classify each data into the cluster corresponding to the centroid that is closest to it. The distance between the object and the random centroid is the Euclidean distance. The formula is as (1);
$$d\left({x}_{t},{z}_{k}\right)=\sqrt{{\sum }_{i=1}^{N}{\left({x}_{t},{z}_{k}\right)}^{2}}$$
1
Among them, N represents the sample set {x1, x2, ..., xt}, {z1, z2, ..., zk} represents K centroids, and the sample points in N are divided into the class closest to the centroid.
-
Recompute the k centroids based on the average distance between all points of the class result;
-
Repeat steps 2 and 3 Until the sum of the distances of all sample points and their corresponding centroids of the class is minimized. The results tend to converge through multiple iterations.
The algorithm flow chart of Kmeans is shown in Fig. 1.
DBSCAN can assume that the clustering results can be determined by the tightness of the sample distribution. The data that are clustered into the same category are closely connected, that is, there must be data belonging to the same category around a certain data in the sample. The final clustering result is obtained by dividing all the closely connected words in the sample set into categories and displaying the results in the form of scatter plots. Different categories are represented by different colors, which are presented with a more intuitive visual experience.
According to [11], for the sample set D = (x1, x2, ..., xm), the DBSCAN algorithm includes 5 core definitions in the implementation process: Eps neighborhood, core object and boundary object, density direct, density reachable and density connected. There can be one or more core points in the cluster. If there is only one core point, other non-core point samples in the cluster are in the Eps neighborhood of this core point. If there are multiple core points, there must be one other core point in the Eps neighborhood of any core point in the cluster, otherwise the two core points cannot be density reachable. The collection of all samples in the Eps neighborhood of these core points forms a DBSCAN cluster.
The quality of the DBSCAN clustering algorithm depends on the parameters Eps and MinPts, so it is necessary to conduct multiple experiments to obtain a set of values with better quality. After many experiments, Eps = 1 and MinPts = 1 are selected. Kmeans and DBSCAN have their own advantages and disadvantages in the implementation process. The comparison results are as follows in Table 1.
Table 1
Analysis of advantages and disadvantages of clustering algorithms
Algorithm name | Advantage | Disadvantage |
DBSCAN | (1)No need specify the number of clusters in advance; (2) Outliers and clusters of any shape can be found. | (1) When the sample data is large, the clustering convergence time is long; (2) When the sample data is quite different, the clustering effect is poor; (3) The parameters are complex. |
Kmeans | (1)The clustering principle is simple and there are few parameters, so the clustering time is fast; (2)There are few parameters, so the process is simple; (3) The clustering effect is good and the interpretability is strong. | (1) Specify the K value in advance, and the selection is not easy to grasp; (2) Generally, only applicable to convex datasets; (3) The initial value is completely random, so the results may belong to the local optimal solution. |
It can be seen from the table that the Kmeans algorithm has simpler parameters than DBSCAN, is easy to implement, and does not take too much time. On the contrary, there are many parameters in the DBSCAN algorithm, which have a great impact on the clustering results, but it does not require Specify the number of clusters K, and the cluster centers all exist in the data samples, while the Kmeans algorithm is a randomly assigned centroid or a value calculated from the mean, not necessarily real in the sample.