Assume that the clustering center is data points with maximum local density, and then calculate the local density (ρi) of each point in the dataset:
where dij is the Euclidean distance between point j and point i, and dc is the cutoff distance selected in the proportion of 1% to 2% of the total distance of data points. Because a Gaussian kernel is used to represent the local distance ρ [7,13], the probability that the local density is the same at all points in the area is low. Assuming that the local density ρ of each point is different, data points are sorted according to the value of the local density ρ. Fig.1 shows the schematic diagram of BCALoD.
In the up process, the condition for judging point k as a cluster center is given by within the cutoff distance, no point has a larger local density than point k. For all points in the dataset, each point establishes a connection with only its upper layer, and there is no relation with the next layer; that is, the tops of the selected data chains have nothing to do with the starting point. The up process is started by selecting from the point with the smallest local density ρ. The addressing sequence reflects only the path from bottom to top and does not change the local density of the cluster centers. For small clusters, if no point within the cutoff distance has a higher local density, those clusters are formed as one category respectively. Overall, the up process can be regarded as the inverse operation of clustering data points in a density peak clustering algorithm.
In clustering by fast search and find of density peaks, when the data sizes of large clusters are much greater than those of small clusters, the information of small clusters is easily overwhelmed, resulting in subjectivity in determining the number of clusters. Bidirectional clustering can properly identify the clustering of small clusters. The mean shift algorithm must embed random data points and continuously iterate the sliding window, which leads to a high data calculation cost and usually hampers analyzing high-dimensional clusters. The proposed method of constructing data clusters by data chains can reduce the operational cost and ensure that the selection of clustering centers is unrelated to the initial selection. BCALoD establishes the data chains, is not required to embed the data into the vector space, can automatically determine the number of clusters, is more sensitive than other methods to small clusters, and reduces the adjusted parameters to the lowest. In summary, this method combines the advantages of clustering by fast search and find of density peaks and mean shift clustering.
We used a Gaussian mixture distribution to generate a 2D dataset (Case 1), as shown in Fig.2a. Fig.2b shows the results of the BCALoD clustering algorithm. Table 1 shows the size of each cluster.
Table. 1. Cluster sizes of 2D Gaussian mixture distribution datasets
No.
|
1-7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
Set size
|
0
|
0
|
50
|
400
|
800
|
1000
|
2000
|
5000
|
Cluster Result
|
1
|
3
|
50
|
394
|
778
|
996
|
2019
|
5003
|
Table 1. Cluster size. The resulting clusters 1 through 7 contained one data point each, while cluster 8 contained three data points. These data points were distant from the other data points, which are referred to as discrete points. In the BCALoD cluster analysis, such data points are independently grouped into one category. The size of cluster 8 is 50, which is much smaller than the amount of data contained in other clusters; this type of cluster is referred to as a small cluster. The results demonstrate the excellent clustering performance of small clusters by the BCALoD algorithm. The accuracy rate is over 98.5%.