A novel spectral clustering algorithm based on neighbor relation and Gaussian kernel function with only one parameter

Spectral clustering has become a prevalent option for data clustering as it performs well on non-convex spatial datasets with sophisticated structures. The spectral clustering eﬀects depend on the construction of the similarity graph matrix. In this paper, in order to further enhance the clustering performance, we propose a novel similarity measure function based on neighbor relations. The proposed method is called SC-NR. It uses the Gaussian kernel function to measure the similarity between two objects. Since Euclidean distance cannot fully reﬂect the relation between data, this method adds a weight related to the order of nearest neighbors to the distance between two points. The similarity is better expressed by weighted-Euclidean distance. In experiments, we compared the proposed method with the previous works via the external indexes, that is, clustering accuracy (ACC), normalized mutual information (NMI), and F-measure. The comparison of indexes with state-of-the-art methods demonstrates the superiority of our algorithm. The experiment includes 6 synthetic datasets and 12 real-world datasets. For instance, in the PenDigits dataset F-measure metric is 16.50% higher than the current algorithms.

Spectral clustering can be considered as a graph partitioning task.Spectral clustering is a way that relaxes the discrete clustering problem into a continuous one and transforms the problem of finding the optimal solution into the problem of solving the spectral decomposition of the Laplace matrix of the graph (Hagen & Kahng, 1992;Shi & Malik, 2000).Since this approach directly performs eigendecomposition on the Laplacian matrix, and then uses the traditional clustering algorithm to cluster the eigenvectors, it can cluster the non-convex data well.Furthermore, spectral clustering is only related to the similar function and has no concern with the dimension of the data.
The similarity matrix is the core of the spectral clustering algorithm.A qualified similarity measure should ensure that the data in the same group have a high degree of similarity and follow spatial consistency.The Gaussian kernel function with Euclidean distance is one common method for constructing similarity matrices.However, due to the scale parameters are set artificially, resulting in clustering results that are not accurate.
Ng et al. (Ng, Jordan, & Weiss, 2001) presented a global-based method for the automatic selection of the scale parameter.The most appropriate value of the scale parameter needs to be determined by repeatedly running the spectral clustering algorithm.This procedure requires manual adjustment based on several experiments to achieve the optimal solution.However, the obtained σ is a global value and probably does not apply well to all data.In 2004, Zelnik-Manor et al. (Zelnik-Manor & Perona, 2004) refined the scale parameter setting problem by setting the scale parameter σ separately for each sample, and the value of σ is automatically adjusted according to the local nearest neighbors of two points x i and x j , avoiding the impacts of a single global scale parameter.Zhang et al. (Zhang, Li, & Yu, 2011) proposed a method called the common nearest neighbor method, which uses the local density information in the nearest neighbor domain to construct the similarity matrix.If two points in the data belong to the same class, then these two points should have similar densities and lie in the same region.Capitalizing on this fact, the common nearest neighbor approach uses the local density information between two data points to adjust the Gaussian kernel function after a fixed radius is given.In 2012, Li et al. (X.-Y. Li & Guo, 2012) introduced the approach of constructing similarity matrix by neighbor relation propagation.If the Euclidean distance between points x i and x j is tiny, then points x i and x j are neighbors of each other.Suppose the distance between x j and x k is also quite small, but the distance between x i , x k is slightly larger.According to the concept of neighbor relation propagation, x i , x k are eventually considered as neighbors as well.This approach exploits the principle of neighbor relation propagation to discover the intrinsic structure of the sample.Cao et al. (Cao, Chen, & Wang, 2022) employed shared neighbors to construct the similarity matrix.Shared neighbors are points that are both k-nearest neighbors of point x i and k-nearest neighbors of point x j .In most cases, each pair of data points has higher similarity when they have more shared nearest neighbors.
In this paper, we propose a novel similarity measure spectral clustering algorithm based on neighbors and Gaussian kernel function with only one parameter, called SC-NR.In SC-NR, we apply the average distance of each data point to k-nearest neighbors to reflect the local density between data points.Therefore, the whole algorithm has only one parameter k.
The remainder of this paper is organized as follows.In Section 2, an overview of spectral clustering algorithm is given.Details of the proposed algorithm SC-NR are described in Section 3. In Section 4, Experimental results are presented.Conclusion is summarized in Section 5.

Overview of spectral clustering
In this section, we briefly cover the basics of spectral clustering and introduce one of the most classical spectral clustering algorithms.
Given a set of m data points X = {x 1 , x 2 , . . . ,x m } ∈ R n , the goal of clustering is to divide these data points into p categories, so that the data points within the same category are as similar as possible, and the data points between different categories are as different as possible.We can represent the relation between these m data points by an undirected weight graph G = (V, E), defining the vertex v i to represent the data point x i .For any two vertexs v i and v j in V, we define the edge weights w ij = w ji , w ij > 0, if there is an edge connection, and w ij = 0, if there is no edge connection.Gaussian kernel function is one of the most common functions used to construct the similarity matrix: , x i ̸ = x j , 0, x i = x j . (1) where dis(x i , x j ) = ∥x i − x j ∥ 2 is the Euclidean distance between the data points x i and x j , and σ is the kernel parameter (Shi & Malik, 2000).
After obtaining the similarity matrix W , the matrix W is used to construct the normalized Laplacian matrix L = I − D −1/2 W D −1/2 , D is a diagonal matrix composed of the degrees of each vertex, defined as Next, we compute the first p smallest eigenvectors v 1 , . . ., v p of L, and form a matrix V with these eigenvectors as columns.Note that a normalization step for each row of V is necessary, Then we can get result by applying K-means algorithm on first p eigenvectors of Laplacian matrix.

Proposed method
In this section, we present a novel approach to constructing the similarity matrix for spectral clustering and analyze the approach computational complexity.The novel approach is based on neighbor relations between each point and Gaussian kernel function with only one paramerer.In the following steps, we introduce the primary content of the SC-NR algorithm.

Constructing the similarity matrix
In Section 2, we mentioned the most common similarity function, the Gaussian kernel function.It measures the similarity between data points and divides them into several groups, but it has the disadvantage that the kernel parameter is fixed.This leads to the problem that once σ is determined, the similarity between two points determined only by their Euclidean distance.Yet Euclidean distance cannot reveal the true structure of data.Assuming there are some points that come from different clusters that have the same Euclidean distance between them.Then those points have the same  proposed a local scale similarity measure, which the σ is set to be the distance between points x i and its k th nearest neighbor.The similarity function is defined as: where x j ∈ kN N (x i ) denotes that x j is a k-nearest neighbor of x i .The value of σ is self-tuning and varies with the change of surroundings.There is no need to manually adjust parameters, thus avoiding differences in results caused by subjective factors.However, there are still some problems.
Figure 1 shows the clustering result of selftuning algorithm (Zelnik-Manor & Perona, 2004) in Twomoons dataset.k is the value that reach the highest NMI in the range of 5 to 100.In Figure 1, the distance between points a, b and c are equal and σ b < σ c , then using the Gaussian kernel function to calculate the similarity yields w G ab < w G a,c .However, the fact is that point a, b should be the same cluster.The value of σ in the adaptive algorithm depends not only on the distance between two points but also on the number of nearest neighbors k.When the value of k becomes large, the k th nearest neighbor of x i may not be in the same cluster as x i .The self-tuning algorithm is failed on Twomoons dataset.

Constructing the similarity matrix based on neighbor relation
The core of the spectral clustering algorithm is the construction of similarity graph.A proper similarity graph can accurately reflect the similarity degree of the data and obtain better clustering results.The basis of similarity graph construction is the similarity function.In this paper, we propose a proper new similarity function based on relation between each point: (3) where q i,j indicates that point x j is the q th i,j nearest neighbor of point x i .Correspondingly, q j,i indicates that point x i is the q th j,i nearest neighbor of point x j .σi is set to the average distance of the knearest neighbors of point x i .ε is set to the same as in (Alshammari, Stavrakakis, & Takatsuka, 2021), The proposed similarity function based on the neighbors has the following properties.
(1) We construct ε-neighborhood graph, which can adjust the number of nearest neighbors for each point.
(2) The similarity matrix W N is sparse, because only when the distance between each pair of points is not greater than ε, the two points are connected by edge and the matrix element is non-zero, that is w N ij = 0 when dis(x i , x j ) > ε.
(3) σ is the average distance between each point and its k-nearest neighbors, which can reflect the local information of data and is insensitive to the value of k.Suppose point x i is located in a dense cluster, KN N (x i ) = {x 1 , . . ., x k } is the set of k-nearest neighbors of the point x i , where point x k is in another sparse cluster, and the remaining points are in the same cluster as , so the mean value is better to distinguish the data from different clusters.
In the k-nearest neighbor graph, k is usually given globally and is fixed.Some points come from different clusters may be connected.Therefore, we choose to construct ε-neighborhood graph and the value of ε is related to the distribution of surrounding data.From Figure 2, we can see that each point has a different ε radius and the points in the circle are data in the same cluster.Even if a circle contains points in different clusters, this item (q i,j + q j,i ) can ensure that the weight of points in the same cluster is greater than that of points in different clusters.Even if a circle contains points in different clusters, this term can increase the distance between points in different clusters, so that the weight of points in the same cluster is greater than that of points in different Bclusters.Suppose there are two clusters A and B, point x i , x j belongs to A and point x h belongs to A. Then we have dist(x i , x j ) < dist(x i , x h ) which means point x j is closer to point x i than point x h .In terms of q i,j < q i,h , q j,i < q h,i , we know that dist(x i , x j ) * (q i,j +q j,i ) < dist(x i , x h ) * (q i,h +q h,i ), which means the distance between point x i and x h is getting larger.
In addition, Von Luxburg et al. (Von Luxburg, 2007) discussed that, the similarity matrix ought to be a block diagonal matrix, in ideal situations, which implies that the similarity matrix being a block diagonal matrix is a requisite for the clusters to be completely separated.Figure 3 illustrates the visualization of the similarity matrix for the Soybean dataset.In the similarity matrix image, the warmer the pixel color corresponds to the higher similarity between data points, and the cooler the pixel color corresponds to the higher similarity between data points.The similarity matrix of SC-NR can be found to be closer to the desired diagonal block matrix, which satisfies the necessary condition for successful cluster separation.

Clustering based on the similarity matrix
After constructing the similarity matrix, we can construct the normalized Laplacian, L = I − D −1/2 W N D −1/2 , as the traditional spectral clustering algorithm does.Then we can get the result by applying the K-means algorithm on the first p eigenvectors of the Laplacian matrix.

Process and complexity of the algorithm
The detailed procedure of the proposed clustering algorithm is described in Algorithm 1.
Step 4. Normalize each row of V , the form of the matrix is Step 5. Considering each row in V as a point and dividing them with The complexity of constructing kNN graph is O(m 2 log m + nk), where n is the number of pairwise data points with weight less than ε.The complexity of affinity matrix construction is O(m 2 ).Computing eigenvectors from the normalized Laplacian matrix and normalizing them entail a complexity of O(2p 2 + mp).The final step of the algorithm implements clustering using K-means with a complexity of O(mp 2 t), where t is the times of iteration.Consequently the integral complexity of our SC-NR algorithm is O(m 2 log m + nk + m 2 + 2p 2 + mp + mp 2 t).  4 Experimental results In this section, we present the relevant data information needed for the experiments, the evaluation metrics, and the experimental results.We not only compare with the classical K-means and spectral clustering algorithm, but also measure with several state-of-the-art spectral clustering algorithms.All experiments are conducted in MATLAB 2018b on a PC with an AMD Ryzen 3600 CPU and 16 GB of RAM.

Datasets
Our SC-NR algorithm is evaluated on six synthetic datasets (datasets are shown in Figure4) and several real-world datasets.Two of these synthetic datasets, SF2000, and Flower2000, were obtained from (Huang, Wang, Wu, Lai, & Kwoh, 2019).
There are two sources of real-world data, one is the feature selection repository (J.Li et al., 2018) which contains biological dataset Lung and face dataset ORL, and the other is the machine learning repository of the University of California Irvine (Asuncion & Newman, 2007).Details about the properties of synthetic and real-world data are shown in Table 1,2, respectively.
Table 3: Comparison of experimental results on different datasets.Note: "/" denotes that no suitable parameters are found to run this algorithm to compute the results for the datasets.We highlight the best results for each dataset clustering with the red font.

Evaluation metrics
We consider three performance indicators to assess the quality of the clusters: normalized mutual information (NMI), accuracy (ACC), and F-measure.The calculation equation of NMI criterion is (Strehl & Ghosh, 2002) where p and m represents the number of cluster and the number of sample points, respectively.w j denotes the probability of data points in the cluster P j obtained by the clustering algorithm, ŵk is the probability of sample points belonging to the k th class of the true clustering, and w jk represents the probability of sample points that belong to both P j and the P k class.
The calculation equation of ACC criterion is (Cheng, Zhu, Huang, Wu, & Yang, 2019;Du, Ding, & Jia, 2016): where l i is the true class label of point x i and p i is the acquired cluster label of point x i .ζ(x, y) = 1, if x == y, 0, otherwise.
The function map(•) matches the real cluster label and the predicted cluster label.
The calculation equation of F-measure criterion is (Chen et al., 2018;Xie, Xiong, Zhang, Feng, & Ma, 2018): where Precision represents the ratio of the number of pairs of data points classified into the same group to the number of pairs of data points actually in the same group, Recall indicates the probability that pairs of data points that are actually in the same group are divided into the same group.NMI, ACC and F-measure ranges from 0 to 1, and indicates a better clustering quality when closer to 1.
While conducting experiments, the only parameter involved is the number of nearest neighbor k, which range from 5 to 100.And we ran the Kmeans algorithm on the indicator matrix 50 times and chose the best result.Table 3 shows the quantitative results of the 8 clustering algorithms for the three metrics ACC, F-measure, NMI on 20 datasets.
For the Soybean and 3 synthetic datasets, the proposed algorithm SC-NR achieves the best results in all three metrics and scores 1 compared to other clustering algorithms.Additionally, for the Iris, Banknote, Pageblock, twospiral, and Flower2000 datasets, the recommended clustering algorithm SC-NR obtains outstanding performance on three evaluation criteria as well.
As we can see in Table 3, a total of 6 datasets scores excellent results in two metrics.Among them, the Liver disorders, ORL, Ecoil, and SF2000 datasets reach the first rank in the ACC and NMI metrics, and the Zoo and PenDigits datasets gain the best results in the F-measure and NMI metrics.
For Lung and Contraceptive datasets, the SC-NR clustering algorithm performed slightly worse, reaching the optimum only in accuracy.However, the Lung dataset ranked third in ACC and NMI indicators, and the Contraceptive dataset is only 0.0186 and 0.0047 lower than the optimal result in F-measure and NMI indicators, respectively.
The above experimental results demonstrate that our suggested SC-NR algorithm has excellent clustering effects for datasets with a different number of classes and different dimensions.The SC-NR algorithm is extremely competitive with other advanced approaches and scores well in all three criteria.

Conclusion
In this work, we define a neighbor relation similarity measure that optimizes the spectral clustering algorithm, called SC-NR.The algorithm captures the local density between data points using the average distance from the data point to the k-nearest neighbors.Experimental results demonstrate that the performance of the SC-NR algorithm on synthetic and real-world datasets is highly competitive with state-of-the-art algorithms.

Fig. 2 :
Fig. 2: The clustering result of the proposed SC-NR algorithm on the Twomoons dataset.

Algorithm 1
Spectral clustering based on neighbor relation (SC-NR) Input: m data points X = {x i | i = 1, . . ., m}, number k of nearest neighbors, number p of clusters to construct.Output: Clusters C 1 , • • • , C p . 1: Step 1. Compute the similarity w N ij of graph G between pairwise points (x i , x j ) according to Equation (2) and construct the adjacency matrix W N .2: Step 2. Compute the normalized Laplacian matrix L: 3: L = I − D −1/2 W N D −1/2 .4: Step 3. Compute the first p eigenvectors

Fig. 4 :
Fig. 4: Six synthetic datasets applied in the experiments.Different colors represent different cluster classes, and the number of clusters varies from 2 to 13.

Table 1 :
Properties of the synthetic datasets.

Table 2 :
Properties of the real-world datasets.