Density Peaks Clustering based on Nature Nearest Neighbor and Multi-cluster Mergers

Clustering by fast search and find of Density Peaks (DPC) has the advantages of being simple, efficient, and capable of detecting arbitrary shapes, etc. However, there are still some shortcomings: 1) the cutoff distance is specified in advance, and the selection of local density formula will affect the final clustering effect; 2) after the cluster centers are found, the assignment strategy of the remaining points may produce “Domino effect”, that is, once a point is misallocated, more points may be misallocated subsequent ly. To overcome these shortcomings, we propose a density peaks clustering algorithm based on natural nearest neighbor and multi-cluster mergers. In this algorithm, a weighted local density calculation method is designed by the natural nearest neighbor, which avoids the selection of cutoff distance and the selection of the local density formula. This algorithm uses a new two-stage assignment strategy to assign the remaining points to the most suitable clusters, thus reducing assignment errors. The experiment was carried out on some artificial and real-world datasets. The experimental results show that the clustering effect of this algorithm is better than those other related algorithms.


Introduction
Clustering analysis is a significant technique in data mining (Frey et al. 2007;Han et al. 2011). It is the grouping of similar data objects into the same cluster by attributes or features to find principles, knowledge, or rules that have existed but not been found (Rui et al. 2005;Guyon et al. 2012). It has been widely employed in many areas, such as image processing, business intelligence, and machine learning (Pappas 1992;Guo et al. 2009;Leong et al. 2017).
Because of the extensive application of clustering, scholars have researched clustering and put forward many clustering algorithms. According to the characteristics of different clustering, clustering include partition clustering (Kaufman 2008), hierarchical clustering (Dasgupta et al. 2005), density clustering (Kriegel et al. 2011), grid-based clustering (Deng et al. 2018), and model-based clustering (Fahad et al. 2014). K-means algorithm (Jain 2009) is a well-known partition clustering algorithm. It is brief and efficient, but the number of clusters needs to be fixed in advance. Moreover, the random selection of the initial cluster centers will easily lead the result of clustering into the optimal local solution.
DBSCAN algorithm (Ester et al. 1996) is a widely used density clustering, which improves the clustering speed, but the clustering result is sensitive to neighborhood radius and threshold. The proposal of the OPTICS algorithm (Ankerst et al. 1999) partly solves the problem of sensitive input parameters of DBSCAN algorithm. Rodríguez et al. (2014) first proposed DPC algorithm. The algorithm is efficient, can detect any shape of clusters, and does not need to specify the number of clusters. At present, DPC algorithm has huge potential application value in image segmentation, medical image, community discovery, etc.
However, the DPC algorithm also has a few insufficient: 1) manual selection of the cutoff distance and subjective selection of the local density formula have relatively large impact on the selection of the initial cluster centers; 2) assignment strategy of remaining points is easy to form the consecutive assignment error. A point assignment error may lead to a series of subsequent point assignment errors, which will affect the effect of clustering.
Due to the above shortcomings, many scholars have made improvements to the DPC algorithm.
Aiming at selecting cutoff distance and the definition of local density: the DPC-KNN-PCA algorithm (Du et al. 2016) was proposed to replace the selection of cutoff distance with K-nearest neighbors. Yang et al. (2018) achieved the adaptive cutoff distance to propose a cutoff distance adaptive algorithm based on information entropy. Liu et al. (2020) use the reverse natural nearest neighbor to optimize the local density formula to find the initial cluster centers better and get the clustering result better. Wu et al. (2020) redefine the method of selecting initial cluster centers by considering the weight of local density and relative distance. Fang et al. (2020) proposed the CFDPC algorithm by establishing the connection between the cutoff distance and the data, which calculates local density and clustering adaptively. Lv et al. (2021) proposed using geodetic distance and dynamic neighborhood to measure the similarity between points. It has a nice clustering effect in manifold datasets. Tang et al. (2021) proposed the NNN-DPC algorithm. In the NNN-DPC algorithm, Local density is obtained through natural nearest neighbor, and the process is parameterless.
For the consecutive assignment error in the assignment: Xie et al. (2016) proposed the FKNN-DPC algorithm, which divides the remaining points into core points, ordinary points, and outliers, and gives them different sequences and weights, respectively, and then assigns them. The DPCSA algorithm (Yu et al. 2019) was proposed. It introduces a weighted local density sequence and a two-stage assignment strategy, improves efficiency by designing the nearest neighbor dynamic table. This algorithm breaks the local density decreasing assignment sequence and reduces the propagation of assignment errors. Wu et al. (2019) proposed the DPC-SNR algorithm to calculate K-nearest neighbors and reverse K-nearest neighbors for each point, thereby drawing asymmetric neighborhood graph and using the symmetric neighborhood graph to assign the remaining points, thereby reducing assignment errors. Chen et al. (2020) defined the similarity between clusters to mergers clusters, thereby reducing the possibility of assignment errors.
For these shortcomings, we also propose a density peaks clustering algorithm based on natural nearest neighbor and multi-cluster mergers. In this algorithm, the local density is redefined applying the idea of natural nearest neighbor and weighting and considering the relationship between the whole and the part. Furthermore, the probability of misassignment of the DPC algorithm is reduced by a two-stage optimization algorithm. The experimental results indicated that the effect of this algorithm is effective and robust.
The structure of this paper: In Sect.2, we introduce the DPC algorithm and its shortcomings. In Sect.3, the DPC++ algorithm is proposed. In Sect.4, we conduct an experimental comparison of the DPC++ algorithm with other related algorithms. Finally, we conclude in Sect. 5.

DPC Algorithm
In the part, we introduce the DPC algorithm and its shortcomings.

Algorithmic principle
The basic assumptions of the DPC algorithm are: 1) each cluster center is surrounded by low local density neighbors; 2) there is a relatively large distance between the cluster centers. For any point i x in the dataset D ,firstly, the DPC algorithm calculate the local density and relative distance of point i x ; then select the density peaks (i.e., cluster centers) by decision graph is formed by the local density and relative distance of each point; finally complete clustering according to the assignment strategy.
Local density i  indicates the density around the point i x . There are two kinds of calculation equations: Cut-off kernel Eq. (1) and Gaussian kernel Eq. (2).

( )
After obtaining  and  of all points, the DPC algorithm takes i  and i  of each point i x as horizontal and vertical coordinates in the two-dimensional decision graph, to select the points with higher  and  to as the cluster centers. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density than it.

Existing problems
The DPC algorithm has some shortcomings.
1) Choosing different cutoff distance and density formula, the clustering results are quite different.
In the DPC algorithm, cutoff distance is randomly selected in a range, make the result of the density formula is affected. At the same time, the DPC algorithm has two formulas to calculating local density.
For larger datasets, the Cut-off kernel clustering is better, and for smaller datasets, the Gaussian kernel clustering is better. However, there are no objective criteria for measuring the size of a dataset.
In Fig.1(a) and (b), the clustering effect of 3 c d = is better than that of 2 c d = when both using Gaussian kernel. In Fig.1(b) and (c), clustering effect of the Gaussian kernel is better than that of the Cutoff kernel when using the same c d .
2)The assignment strategy of the DPC algorithm is easy to produce consecutive assignment errors.
After the density peaks is found, the assignment strategy of DPC algorithm is easy to produce a "Domino effect", that is, one point error may lead to a series of point errors. In Fig.2

DPC++ Algorithm
In this part, we propose DPC++algorithm.
In order to solve these questions, we proposed a density peaks clustering algorithm based on natural nearest neighbor and multi-cluster mergers (DPC++). This algorithm will improve the DPC algorithm in two aspects: 1) the local density is defined uniformly; 2) changing the assignment strategy of the remaining points after finding the cluster centers. The prime idea of this algorithm is to calculate the local density of each point based on natural nearest neighbors. After finding the cluster centers and the potential cluster centers, the remaining points are assigned to correct clusters using shared nearest neighbor and finite iterations (Zhu et al. 2014;Zhu et al. 2016;Levent et al. 2003).

Nature Nearest Neighbor
In the study of data structure, the concept of the nearest neighbors is proposed, the most commonly used It can be seen from the above definition that the natural nearest neighbor is asymmetric neighbor relationship, in which the value of the parameter r is only an intermediate variable in the natural nearest neighbor search algorithm. The termination condition of the search is: during two consecutive iterations, the number of outlier points (that is, points without natural neighbors) no longer changes.
The natural nearest neighbor search process starts with 1 r = , gets r nearest neighbors of each point The natural nearest neighbor has great advantages: 1) compared with k -neighborhood and neighborhood, it can be generated without any parameters; 2) natural nearest neighbor can better and more accurately represent the neighbor relationship between points. Compared with k -neighborhood, natural nearest neighbor can avoid outliers with no real neighbors or less than k neighbors forcing distant points to become neighbors, thus better reflect the true neighbor relationship. Compared neighborhood, natural nearest neighbor can avoid the circumstance which makes the neighbor relationship are not inelastic because of setting  , thus to better reflect the true neighbor relationship.

Local density
We aim to solve the question of inconsistent rules of measurement for the local density of the DPC algorithm.
We redefined the local density based on the idea of weighting and natural nearest neighbor.
Moreover, the definition of local density considers both local and global information.
First, in this paper, the idea of a natural nearest neighbor is incorporated join the design of local density. Because it can better express the proximity relationship between points, thus more accurately and naturally evaluate the local density of each point. Compared with the local density calculated by cutoff distance or the K-nearest neighbor method, this approach makes the local density of each point can be calculated adaptively without manual parameters, so it is very robust.

Definition 3 (dependent neighbor) ()
BNN i is the dependent neighbor of point i x . It represents that set of points that belong to nearest neighbors of point i x , but do not belong to natural neighbors of point i x . It is expressed as Eq. (5).
Definition 4 (local density i  ) A new definition of local density based on natural nearest neighbors.
It is expressed as Eq. (6). Second, the density formula in this paper also shows the idea of weighting, giving different weights according to the different importance of each neighbor. This paper classifies the neighbors of each point into two classes: the natural nearest neighbor and the dependent neighbor and gives them different weights.
Finally, the third part of the formula calculates the density contribution of the point except for the nearest neighbors. This definition of local density takes into account both local and global information of the points. This way can avoid the situation that sometimes the selection of the cluster centers focuses on the local area, which makes the cluster centers deviate from the central area, leading to incorrect clustering results.
In Fig.3(b)and(c), the red circles express the cluster centers, the yellow circles express the potential cluster centers, and the black circles represents outliers. It is obvious from the red circles in this graph that the cluster centers of the DPC++ algorithm is easier to find. The potential cluster centers are obvious in the graph of the DPC++ algorithm, but the DPC algorithm is not easy to find. Easier to find potential cluster centers are very helpful for the assignment strategy of this algorithm. For the black circles, we can see clearly that the outlier is obvious in the DPC++ algorithm. However, the outliers are too dense to pick out the true outliers in the DPC algorithm.

Two-stage assignment Strategy
Aiming at the consecutive assignment error of the DPC algorithm, a new two-phase assignment strategy is proposed. It can largely avoid this problem.
Phase I: Selecting potential cluster centers and forming multiple clusters.
After the cluster centers cc are selected, there are still some points in the decision graph, which have higher local density and relative distance than many common points. These points far from other peaks and are surrounded by points with low local density. These points are defined as the potential cluster centers pcc . For any point i x . It is expressed as Eq. (7).
w is a density weight, [0.2, 2] w  ，generally 0.5 w = , w is highly robust when the scale of the dataset is huge.
Taking pcc and cc as density peaks, each remaining point is assigned to the same cluster as its nearest neighbor of higher density than it, and some clusters are formed by taking pcc and cc as peaks.
Phase II: Mergers clusters that pcc as the peaks into cluster cc as the peaks. Definition 6 i co is the set of points. It consists of cc and points whose local density is bigger than i pcc in pcc .
i pcc is any point in pcc . It is expressed as Eq. (9).
The second stage is a brief process: let kr = (r is r of (i)

Time complexity analysis
The time complexity of the DPC algorithm from calculation of distance matrix between points, local density, relative distance, and the assignment strategy. Their time complexity respectively is

Experimental results and analysis
In the part, we conduct an experimental comparison of the DPC++ algorithm with other related algorithms.

Datasets and Data pre-processing
For the purpose of verifying the performance of the DPC++ algorithm, artificial datasets and real-world datasets are used for testing and evaluation. These datasets are described in detail in Table 1 and 2. In the datasets of Table 1

Parameter optimization of each algorithm
We introduce the source and parameter optimization of each algorithm.
In this paper, the DPC++ algorithms are compared with the DPC ( For guarantee the best clustering result of each algorithm, it is necessary to optimize the parameters of each algorithm. DPC++ algorithm needs density weight w to select the potential cluster centers, and the range of w is [0.2,2], and w is robust when the dataset size is huge. The DPC algorithm needs to select cutoff distance c d and needs to choose between two density formulas. The DPCSA and CFDPC algorithms do not require parameters. The parameters required by DBSCAN algorithm are neighborhood radius and threshold, and the change of the parameters has a great influence on the result. After many experiments, all parameters are the best.

Clustering Evaluation
To compare different algorithms, a suitable and uniform evaluation index is needed. In this paper, three

Result analysis of artificial datasets
From Table 3, we can find that: the clustering result of the DPC++ algorithm on the Aggregation dataset is almost equal to that of the best DPC algorithm and better than other algorithms. In the Flame dataset, the clustering effect of the DPC++ algorithm is only slightly lower than that of the best algorithm. In the D15 dataset, the clustering performance of the DPC++ algorithm is slightly lower than that of the DPC algorithm and CFDPC algorithm. However, the difference of DPC++ algorithm is very small and better than other algorithms. Except for the above datasets, in D31, Spiral, Spiral_unbalance, Moon, and Circle  Fig.4 The effect of the five algorithms on the Aggregation dataset In Fig.4, we can observe that: the Aggregation dataset has composed of 7 heap-like clusters, whose characteristics are obvious, but there is cross-linking between clusters. The clustering performance of the DPC++ algorithm is slightly weaker than that of the DPC algorithm, ranking second. The DPC++ algorithm can find cluster centers exactly right. The main assignment error is the point on the right side where the two cluster boundaries are connected. The label of the cluster boundaries may be disputed because the dataset is artificial, so the performance of the DPC++ algorithm is good.  In Fig.6, we can observe that: the D15 data set has fifteen clusters. The outermost seven clusters are far away from each other, all clustering algorithms can cluster these seven clusters correctly, and the innermost eight clusters are adjacent to each other and intertwined, so the samples of these eight clusters are easy to be misassigned. Except the DBSCAN algorithm, the clustering accuracy of other algorithms for the D15 data set is above 99%. In Fig.7, we can observe that: the D31 dataset contains thirty-one class clusters. There are many kinds of clusters, and they are connected. These clustering algorithms can perform clustering well, but the DPC++ algorithm is the best.  The effect of five algorithms on the Spiral_unbalance dataset In Fig.9, we can observe that: the Spiral_unbalance dataset is a manifold data set consisting of two clusters whose Spiral are unbalanced. The DPC++, DPC, and DBSCAN algorithms are completely clustered correctly, while the DPCSA and the CFDPC algorithms have poor clustering effects. In Fig.11, we can observe that: the Circles dataset is manifold dataset consisting of two circle-like clusters, one is big, and the other is small. Among them, DPC++, DPCSA and DBSCAN clustering algorithms cluster successfully, and DPC and CFDPC clustering algorithms have larger errors.

Analysis of experimental results of real-world datasets
For verify the clustering performance of the DPC++ algorithm in real-world datasets, we compare the DPC++ algorithm with four other kinds of clustering algorithms on eight real-world datasets. The clustering results of each algorithm are shown in Table 4. The experimental results show that the ACC and ARI indices of the DPC++ algorithms are slightly lower than that of the DPCSA algorithm when dealing with the iris dataset. When dealing with the parkinsons dataset, the clustering result AMI index of the DPC++ algorithm is slightly lower than that of the DPC algorithm. In the rest datasets of dermatology, the wdbc, libras, wine, seeds and segmentation, the clustering effect of the DPC++ algorithm is better than the compared algorithms.  .13 The mean of three evaluation indexes on real-world datasets In Fig.13, the mean of all indexes of the DPC++ algorithm is the first among all clustering algorithms. The results show that the DPC++ algorithm is the best among all algorithms.

Conclusion
In the DPC algorithm, selection of cutoff distance is difficult, and the definition of local density is not uniform. In addition, it is easy to make consecutive assignment errors. Based on the above shortcomings, we propose a density peaks clustering algorithm based on natural nearest neighbor and multi-cluster mergers. The DPC++ algorithm combines the idea of weighting and natural nearest neighbors, redefines the local density, measures the local density of each point more accurately, and solves the problem that the selection of cutoff distance and the measurement of local density definition are not uniform. At the same time, the DPC++ algorithm is combined with a shared nearest neighbor and fast and finite iteration to mergers multi-clusters, which reduces the possibility of consecutive assignment errors. Experimental results on various datasets show that the DPC++ algorithm can find density peaks and process datasets with different shapes more precisely, and the performance of clustering is very good..
The DPC++ algorithm does not involve the extraction and processing of noise, so that the clustering of boundary points in some datasets may be wrong. The next step is to further identify and process the noise based on this method to improve the accuracy of clustering, which is the focus of the next stage of research.