DDCM: a decentralized density clustering and its results gathering approach

The use of distributed clustering is an important method of solving large-scale data mining problems. There are still some problems associated with distributed clustering, such as a performance bottleneck on the master node and network congestion caused by global broadcasting. This paper proposes a decentralized clustering method based on density clustering and the content-addressable network technique. It can form a cluster with excellent scalability and load balancing capabilities based on several surrounding nodes. In addition, a method is presented for optimizing the way clustering results are gathered in different application scenarios. Based on our extensive experiments, the proposed approach performs three times better than benchmark algorithms in terms of efficiency and has a stable expanding ratio of about 0.6 for large-scale data sets.


Introduction
As information technology advances, massive amounts of data and information about human lives and productions can be collected and stored.Due to the increasing amount of data stored in databases, cluster analysis of extremely large datasets has become increasingly important.For extremely large-scale data, clustering algorithms [1][2][3] present issues of scalability and efficiency.Among the most important topics in data mining research is cluster analysis of extremely large-scale data sets.In order to improve clustering efficiency, distributed clustering is a highly effective method [4,5].
In a distributed computing environment, distributed clustering abstracts classification patterns from large-scale data sets.As a general rule, it is based on the following concept [6].On the local nodes, the first clustering is performed.A second set of local clustering results or parameters is sent to the master node by the local clustering nodes, and the master node clusters these data globally in order to obtain global clustering models.Upon receiving these models from the last master node, each node updates its cluster in accordance with the global models.
The distributed clustering process requires all nodes to gather and broadcast information.It is likely that the amount of network traffic for the master node will increase substantially when the number of distributed nodes is large.Clustering on a distributed basis requires more time than clustering on a centralized basis.
Decentralized architecture is an important feature for next-generation computing, which will have to deal with massive amounts of data, a large number of computing nodes, and various computing tasks.In the case of smart cities, for example, data processing should adopt a decentralized architecture in order to be able to analyze all the data efficiently.Decentralized architecture is therefore the direction in which clustering analysis will move in the future.
It is possible to alleviate the load on a master node in distributed clustering through a decentralized architecture with high autonomy at each node.In a decentralized system, there is no longer a need for a master node, and each node completes its computing tasks in an equal and cooperative manner.As a result of the introduction of the decentralized idea into the clustering of mass data, it is possible to balance the load on each node and improve computing efficiency.In terms of architecture, peer-to-peer networks are considered to be a classical, decentralized model.A number of algorithms, including DBDC [7], k-DmeansWM [8], attempt to complete distributed clustering by utilizing peer-to-peer networks.Clustering computations, on the other hand, necessitate global information, which is achieved by broadcasting wide information or establishing a master node to share information.Clustering based on peer-to-peer networks also has the issue of load imbalance or network overload.
A new algorithm, called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [9], generates clusters based only on local information and does not require global parameters.This algorithm may be suitable for use in decentralized environments.The Content Addressable Network (CAN) [10] is a method of decentralized data distribution.As a result, multidimensional datasets are balanced across the nodes, and similar datasets are distributed in close proximity to one another.By combining the two techniques above, we propose the Decentralized Density Clustering Method (DDCM).The main concept is as follows: Firstly, clustering objects are distributed across CAN spaces in accordance with their attributes.Secondly, each node finishes clustering cooperatively using the DBSCAN algorithm in the CAN space.A detailed discussion of the above two steps will be provided in Sect. 3 and Sect.4.
The DDCM is based on the decentralized principle.Clustering involves nodes that do not rely on global information and create clusters by combining many adjacent nodes.As a result of DDCM, performance bottlenecks are eliminated, and the number of nodes is no longer limited by the number of nodes.As the number of nodes increases, the number of clustering objects can increase linearly.
The issue of gathering clustering results is also involved when distributed clustering of datasets is completed.There are a variety of demands associated with gathering results.The abnormal samples are needed for anomaly detection, while the clusters generated are needed for classification.With heavy network traffic in a distributed environment, result gathering is an integral part of the process.Still, there is a great deal of room for optimizing different requirements for gathering results.According to the characteristics of DDCM, which will be introduced in Sect.5, we give methods for gathering results in different application scenarios.
As a result of our research, we have made the following major contributions: (1) We propose a decentralized clustering algorithm that achieves load balancing, reduces network load, and is scalable.(2) In this paper, we analyze application scenarios related to gathering results in distributed clustering and design different optimization schemes.As a result, distributed clustering could be further enhanced in terms of application efficiency.

Related work
It is necessary to employ distributed clustering in order to improve the time efficiency of the clustering algorithm for very large datasets [4].In academia, there has been a great deal of research on distributed clustering, which can be categorized into three categories.First, a MapReduce-based distributed clustering approach [11][12][13][14][15].With this approach, classical clustering algorithms are adaptively improved using MapReduce, and calculations are completed in parallel.In spite of this, the MapReduce framework is stylized.A clustering calculation follows a pattern that executes a Map operation followed by a Reduce operation, which makes it difficult to adjust data characteristics and clustering requirements flexibly.It's likely to lead to problems with data skew and slow convergence.
The second is distributed clustering with master nodes [6,7,[16][17][18][19][20][21].Using this approach, a master node manages nodes, gathers intermediate results or parameters, and broadcasts them [7].Furthermore, its accuracy rate is positively correlated with the amount of data transferred.These factors all contribute to the bottleneck of the master node's performance, especially when it comes to its network traffic.
Peer-to-peer distributed clustering is the last one [8,[22][23][24][25][26].Peer-to-peer protocol is used to organize nodes and distribute data.No master node exists, and each node is equivalent.When forming clusters, all nodes must exchange information globally, and the network will be heavily loaded.For example, the nodes broadcast their own information and gather the information of all the other nodes after local clustering [8], which generates significant traffic.In Li et al. [26], nodes are organized in a tree structure, and each node sends information to the root node.It is actually the root node that becomes a master node, and this creates a bottleneck in performance.
Accordingly, distributed clustering still offers the opportunity to optimize load balancing and network traffic.To finish clustering in a decentralized network, we propose the DDCM algorithm in the paper, which uses only local information.As a result, network traffic could be reduced and time efficiency could be improved.
The Peer-to-Peer network is a common decentralized structure, and its mature techniques include CAN [10], BATON [27], Pastry [28], Chord [29].By using multi-dimensional coordinate space, CAN constructs an overlay network, suited for multidimensional datasets in a decentralized environment.

Distribution of clustering objects over decentralized nodes
It is necessary to distribute clustering objects over computing nodes before clustering computation can take place.The paper utilizes CAN to organize the distribution, which is based on a virtual space with d-dimensional Cartesian coordinates.The whole space is assigned to all the nodes, and each node is responsible for maintaining its own disjoint region.An overlay network representing the virtual space of coordinates is formed by the nodes of CAN that self-organize.It is necessary for each node to maintain the network addresses of its neighboring nodes and to generate its own routing table.It is possible to determine the path between any two nodes in the space by using the routing table.
The dimensions of CAN space should be equal to the attributes of the clustering objects.The multidimensional attributes of each object should be mapped to a unique point in CAN space.These points are collectively referred to as data points.In CAN space, each node has its own region.In the event that a data point enters the region of one node, that node will be responsible for the data point.Due to load balancing, the number of data points managed by each node should be substantially the same.As a result, each node is able to enhance its computing ability, improving the efficiency of clustering in terms of time.
It is assumed that data points have two-dimensional attributes.As shown in Fig. 1, data points are mapped to nodes in a two-dimensional CAN space.Data points are denoted by dots.The blue point represents a normal point, while the yellow point represents an abnormal point.A grid represents a computing node.Each node is observed to manage a portion of the data points.There are normal and abnormal points on the local node.

Distributed density clustering in CAN space
It is a spatial clustering algorithm of high performance called DBSCAN (Density-Based Spatial Clustering of Applications with Noise).Its fundamental idea is to put p and q in the same cluster if they are connected by density [9].All the data points within the circular region of e radius from data point p is referred to as e neighborhood of p, which is denoted as , where D represents the set of data points and dist p; q ð Þ represents the distance between p and q.
For a given parameter MinPts, if e neighborhood of a data point p in D contains at least MinPts points, p is referred it as the core point; otherwise, it is referred to as the non-core point.For two points p and q in D, if q is in the e neighborhood of p and p is a core point, q is directly density-reachable from p.If q is a core point and r is in the e neighborhood of q, r is density-reachable from p.It is possible for density-reachable nature to spread.
DBSCAN can be used in CAN space since the relative position of the clustering objects in CAN space remains unchanged.To cluster the data points, it just computes adjacent ones.There is no need for global information, and DBSCAN is suitable for decentralized and distributed environments.It is still possible to apply DDCM when using DBSCAN in a decentralized environment.The following definitions are provided for the purpose of description.
Definition 1 Node Node refers to the computing node that is organized according to the CAN protocol.The term node is also used to refer to a local node.There are nodes that are peer-to-peer and completely decentralized.The logic and  To simplify the model, we assume that each node has the same computing performance.A roughly equal number of data points should be computed by each node.
Neighbor node: In the case of two nodes A and B, the responsible space of B is adjacent to the responsible space of A. The nodes A and B are neighbors.
Adjacent node: Node B's space is adjacent to node A's space.The node B is adjacent to the node A. A's neighbor node is included in the set of adjacent nodes.
Definition 2 Data point Based upon their attributes, the mapped points of clustering objects are mapped in CAN space.For short, it can be referred to as a point.
Local data point: A local node in the CAN protocol manages a portion of the data points in D. These data points are referred to as local data points.The node hosting these points is called a hosting node.
After computing N e p ð Þ, DBSCAN executes the density clustering algorithm.Following is a description of the DBSCAN algorithm.The first step is to select an unlabeled core point for the creation of the cluster.Second, find all the density-reachable points with parameters e and Min-Pts from the core point, and add them to the cluster.Third, repeat the above steps until all the points have been processed or there are no remaining unlabeled core points.
The data points in CAN are distributed among the local nodes.In centralized DBSCAN, N e p ð Þ and density clustering algorithms are computed differently.As a next step, we will discuss the computation method for N e p ð Þ and the adaptive improvement of the clustering algorithm.

N e (pÞcomputation on local node
Clustering involves two types of data points.We refer to them as interior points and edge points.For Node n10 in Fig. 2, those in blue are interior points.Since the e neighborhood of the interior point is in the space charged by the local node, the N e p ð Þ computation of the interior node can be completed on the local node.The yellow points in Fig. 2 represent edge points.Consequently, N e p ð Þ computation of edge point requires information about its neighbor node, since e neighborhood of edge point is partly based on neighbor node information.
For the yellow points of node n10 in Fig. 2, there are three neighboring nodes within their e radius.For N e p ð Þ's computation, n10 is needed to exchange information with 3 nodes.The size of space charged by each node in the CAN protocol is different in consideration of load balancing.The e radius of a point's edge may have a complex distribution on its adjacent nodes.A two-dimensional CAN protocol is illustrated in Fig. 3 as an example of node distribution.Nodes have different sizes of space, and their distributions are uncertain.As shown in Fig. 3, N e p ð Þ computations of yellow points require the assistance of shaded nodes.In order for N e p ð Þ to be computed, complex distributed scheduling and a great deal of communication between nodes are required.There would be a heavy load on the network if there were a large number of edge points on a local node.
To address these problems, we have developed a Distributed method of Computing N e p ð Þ (DCN) that can be used to complete N e p ð Þ computation accurately and quickly in CAN.The main idea of this solution is that information about adjacent nodes is collected in advance by the local node, so that N e p ð Þ computation is no longer n10 (6) If G ¼ ; is valid, end the process; otherwise, repeat steps (2) through ( 5).(7) The node n10 computes the value of N e p ð Þ based on peripheral points.
As a result of steps (2) to ( 5) being repeated, all data points in e radius around the edge point are collected by n10.Therefore, N e p ð Þ computation of both interior point and edge point is dependent on the local node in order to be completed rapidly.
As part of the distributed and rapid N e p ð Þ computation method that we designed, steps (2) to ( 5) are referred to as peripheral point collection processes.It is a distributed process that can be parallelized.During step (2), as many scattered points as possible can be generated and sent to multiple nodes.It is possible to execute multiple nodes in parallel during step (3).
It is possible to improve the efficiency of computing N e p ð Þ. by using the DCN method.In DCN, the computation of N e p ð Þ is based on local nodes, and the process is only required to interact with neighboring nodes.The time complexity of interaction is O N a ð Þ, where N a is the number of neighbour nodes.An intuitive algorithm involves analyzing all the nodes of N e p ð Þ.Based on the assumption that these nodes number P a , the complexity of the interaction between the nodes in the network is OðP a Þ.In general, p a ) N a stands.Therefore, the DCN is clearly superior in terms of performance.

Clustering across nodes
The density clustering process begins after the computation of N e p ð Þ has been completed.There are two steps involved in the process.First, local nodes separately execute the DBSCAN algorithm on local data points, resulting in the formation of several clusters.Second, if there are edge points, it is necessary to merge the nodes together.As can be seen in Fig. 2, yellow points should merge with the clusters of other nodes.Clusters that have been merged together are referred to as cross-node clusters.This section discusses the merging method for cross-node clusters.
A classic DBSCAN algorithm is used for local clustering.Clustering occurs at the interior, edge, and peripheral points of a local node, and many clusters are generated.There are two types of clusters: internal clusters and cross-node clusters.As with cluster C0, in internal clusters, the points are interior or edge points, and there are no peripheral points.An internal cluster is a category that does not merge with clusters of other nodes.
Cross-node clusters have peripheral points and data points from other nodes that need to be added to the cluster.There is a cross-node cluster along the border of a local node, and two or more clusters may merge.We conclude two cases of merging for cross-node The first is that peripheral points in a cluster are core points in other nodes.Therefore, another node must have a cluster that includes the core point.It may be necessary to merge the clusters of two adjacent nodes.There's a yellow point in cluster C1 in Fig. 5, which is both a core and peripheral point.Due to their density-reachable nature, C1 of n10 and C2 of n11 can be combined to form a cluster.Furthermore, peripheral points within a cluster may be non-core points of other nodes.It is not possible for non-core points to form a cluster on a neighboring node.There is no need to merge clusters; we simply need to add the non-core points to the cluster.Accordingly, the orange point in Fig. 5 is a peripheral point of n10, not a core point of n11, and it has been directly incorporated into C1.
The merging algorithm of cross-node clusters requires the cooperation of two nodes and involves two child processes.One is the Discovery Process, which transmits information about cross-node clusters on the node to its neighbor node.The other is the Merge Process, in which merging is executed following receipt of a merge request from the discovery process.
It is assumed that the local node is denoted as n10.A concrete algorithmic description of the discovery process is provided in Algorithm 1.
The messages from neighboring nodes are always monitored by each node.Upon receiving the message of D classified as C1 from the discovery process of the neighbor node, the merge process is initiated.Let n11 be the local node.An algorithmic description of the merge process is provided in Algorithm 2. When merging cross-node clusters, it is necessary to merge the clusters that each core point involves.Algorithm 2 provides a guarantee for the requirement, otherwise the merging operation cannot be completed correctly.For example, for node n11 in Fig. 6, the clusters C2 and C3 are different after local clustering.C2 and C3 perform crossnode clustering with C1 when nodes n10 and n11 are doing cross-node clustering.Therefore, C2 and C3 become the same cluster after cross-node clustering, and the cluster is denoted as C1.
On the local node, the metadata of the cross-node cluster is comprised of multiple tuples denoted as \C j ; N j [ , and each tuple represents a cross-node cluster of the node.In this case, j represents the identifier of the cross-node cluster; C j represents the set of its data points on the node; N j represents the set of hosting nodes the cross-node cluster is distributed on.In the case of cross-node clusters involving two nodes that are merged, information can quickly spread among the nodes the clusters involve via N j .
In Merge Function C1; b i :C j jb i in n11 À Á , merging two clusters is implemented by modifying the identifiers of both clusters into the same one.There are times when a cluster consists of more than two nodes.In this case, it would be appropriate for all the nodes that have the cluster to be informed and to update their respective identifiers.In Fig. 6, it is assumed that nodes n9, n10, and n11 have completed the merger of cross-node clusters, where j represents C1 and N 1 ¼ n9,n10,n11 f gis valid.Moreover, n12 and n13 have also completed merging, where j is C4 and When n11 and n13 merge, the metadata of all the nodes the cluster involves should be updated, including nodes n9, n10, n11, n12, and n13.The j value of these nodes is changed to C1, i.e., N 1 ¼ n9, n10, n11, n12, n13 f g .Each node in a decentralized environment is heterogeneous.In the event that local node n11 receives merge requirements from the cluster, including its neighbors, but clustering has not yet been completed, n11 should first wait for clustering to complete before executing the merging process.Waiting does not affect the efficiency of clustering.First, the load of each node is balanced relatively, and the waiting time will not be excessive.Second, because the local nodes are performing the clustering without excessive computing power to perform merging, the waiting process does not waste the computing capacity of the local nodes.

Results gathering
It is necessary to gather clustering results after clustering has been completed.The node that receives clustering results is referred to as the integration node.Obtaining information from many nodes may result in a significant load on the network.For distributed clustering of largescale data, different demands exist for gathering results, and gathering efficiency can be optimized depending on the application requirements.Next, we will discuss three types of requirements for gathering results.
The first one is to gather all the clustering results.Similar to centralized clustering, it gathers information about all clusters and outliers.It is necessary to gather information about all the nodes in order to determine demand.Therefore, the first case of gathering degenerates into clustering with a central node and there is little room for optimization.
The second one is to gather information about only clusters.In most cases, it will be suitable for clustering applications.For example, unsupervised learning requires only the information regarding clusters and the members of clusters.As a result of this demand, a rapid gathering method is required in order to reduce the load on the network.This topic will be discussed in Sect.5.1.
The third one is to gather the minimum or maximum value, such as finding the largest cluster or the most abnormal data point.It is possible that the method of gathering data first and then comparing it could result in severe performance bottlenecks.When the comparison operation is performed on the local node, the bottleneck is eliminated.A quick query approach to determine the maximum and minimum values of clustering results is proposed in Sect.5.2.

Clusters collection
Clusters can be collected in two different ways.The first method involves each node sending its own clusters to the integration node, including both internal clusters and cross-node clusters.The method is applicable to a small number of nodes.A large number of nodes would result in a high level of network load on the integration node.The second method is to send information about the hosting node in units of clusters.Internal clusters directly send information, while cross-node clusters send information to a relay node and then report to the integration node.In the case of a few clusters, this method is applicable.There is no doubt that the network load of the integration node will decrease when the number of clusters is largely less than the number of nodes.
In this article, we will not discuss the first method, which is very simple.According to the second approach, the cross-node cluster selects one node to report information, and this node is called the primary node of the cluster.Once the clustering of the cross-node cluster has been completed, the primary node will report for the first time to the integration node.Therefore, the primary node must be able to determine the completion of all the hosting nodes for a cross-node cluster on a timely basis.
According to the N j value in \C j ; N j [ , the primary node confirms with all the hosting nodes of cluster i.When the primary node determines whether the merge process has been completed, there is still an issue.Merging of clusters is a dynamic process, and the variable N j is used to dynamically add nodes during the process.
There is no way a primary node can know at any given time whether N j includes all the hosting nodes of a cluster.In order to resolve this issue, we propose a determination approach of merge completion.By using this method, it is possible to verify in an accurate and timely manner whether the merging of cross-node clusters has been completed, i.e., N j no longer changes.The main steps in this process are as follows.
C i metadata on primary node are denoted as \C j ; N j [ ,n i 2 N j , at some point.When both of the following two requirements are met, the border of cross-node clusters is identified and clustering is complete.For every n i in N j the two requirements are that the cross-node cluster discovery process on n i is complete and that all the neighbor nodes C j involves on n i have completed the merge process about C j .

Gathering of maximum or minimum value
In some clustering applications, it is not necessary to gather all the results.Anomaly detection, for example, requires only abnormal data, while behavior analysis only requires some of the largest clusters.In the next section, we present a method for determining the maximum and minimum values of DDCM.
(1) As shown in Fig. 7, the nodes construct a competition tree in the CAN division order, and each node outputs either a maximum or a minimum value to the competition tree.(2) The root node of the competition tree is the integration node, just as the red stroke in Fig. 7. Root node values are either maximum or minimum.
In large-scale clustering scenarios, other requirements must be met, such as determining whether two data points belong to the same cluster.In most cases, the rapid gathering method is designed in accordance with the distribution characteristics of cross-node clusters.

Other cases of results gathering
In addition to abnormal value gathering, assigned cluster gathering, there are other cases of results gathering.There are two types of abnormal value gathering.The first is that each node determines abnormal values and sends them directly to the integration node.Another way to do it is to collect the most abnormal nodes and compare the abnormal values on each node.Consequently, this becomes the case in Sect.5.2.In order to gather the assigned clusters, the integration node is required to collect the information of all the primary nodes, determine the required clusters, and find the corresponding primary node.The assigned clusters can be gathered using the distributed node provided by the primary node.

Performance analysis and experimental demonstration
The performance of DDCM is analyzed in Sect.6.1, and its efficiency and scalability are validated experimentally in Sects.6.2, 6.3, 6.4, 6.5.ð Þ, where C represents the number of cross-node clusters, K represents the average number of hosting nodes for cross-node clusters.For each time of merging, all hosting nodes are informed.In extreme cases, merging may require the execution of K times and the communication cost for generating a cross-node cluster may be as high as K 2 .

Performance analysis
Based on a decentralized architecture, the DDCM algorithm completes clustering tasks.The computation and communication costs of the local nodes are low, and the performance bottleneck is eliminated.

Experimental setting
With the help of OpenStack, we are able to create 8 to 64 virtual nodes.Each node is equipped with the Ubuntu operating system.Each node is equipped with a processor, 2 GB of memory, and a 50 GB disk.The clustering program is written in Python.According to Fig. 8, the data are generated at random and their distribution can be seen.The amount of data is assumed to be no more than 3 megabytes.
In a distributed algorithm, data should be sent to each node before the clustering computation takes place.When DDCM transmits data to CAN nodes, the nodes are divided according to a load balancing rule.In extreme cases, M times divisions may occur, which is a time-consuming process.In the experiments, two methods of initializing CAN are presented.One is the classical method.First, each node's average load is calculated.Nodes are divided when their load exceeds their average load.In the experiment, the method is referred to as DDCM.In addition, there is an improved method.Nodes divide CAN space equally.Once the data points are assigned, keep executing the following operations, i.e., if the difference between maximum load and minimum load is larger than a threshold, the node with the minimum load shares the maximum load's load.In the experiment, the method is referred to as DDCM-A.
As a comparison object, the MapReduce-Based Density Clustering (MBDC) [11,30] is selected as a benchmark.The MBDC algorithm has been improved as follows in order to facilitate the experiments.Reduce phase, merge clusters into separate partitions and store them in HDFS.

Time efficiency
The subsection tests the efficiency of DDCM in terms of time.Data size ranges from 3 K to 3 M, and the number of nodes is assumed to be 32.As shown in Fig. 9, both DDCM-A and DDCM perform better with an increase in data volume, however DDCM-A consumes less time than DDCM.

Scalability
We validate the speedup ratio and expanding ratio of DDCM in this subsection.Speedup ratio is the time ratio of single node consumes and distributed execution consumes.The expanding ratio is calculated by dividing the speedup ratio by the number of nodes.It is assumed that the amount of data is 3 M and the number of nodes varies between 8 and 64.It has been observed that both DDCM-A and DDCM have stable speedup ratios (Fig. 10) and good expanding ratios (Fig. 11) when the number of nodes is large.

Result gathering performance
It is in this subsection that we test whether DDCM can gather maximum or minimum values efficiently.The gathering method is described in Sect.5.2 and is known as the DDCM method.An intuitive method called the Aggregating Method (AM) is used for the benchmark algorithm.Essentially, all extreme values are sent to the integration node, where they are sorted in order of maximum or minimum values.There are 32 nodes in our experiments, and the number of maximum and minimum values ranges from 1 to 1000.As can be seen in Fig. 12, DDCM offers significant advantages, especially when there

Conclusions
Based on a decentralized architecture, DDCM performs distributed clustering tasks.Clusters can be generated by several adjacent nodes without setting a master node or broadcasting global information.The proposed method takes full advantage of large-scale, distributed computation while requiring minimal network resources.Experiments have demonstrated that it eliminates performance bottlenecks and provides good scalability.Meanwhile, since each node has the same function and logic, the difficulty of distributed deployment and management is reduced.Due to this, DDCM could well be adapted to the clustering demands of large-scale data sets.In the context of the development of the digital society, our method has potential applications and significant benefits.Future efforts will focus on an automatic setting of the number of nodes according to the different volumes of data, which will further reduce the cost of clustering. n10

Fig. 1
Fig. 1 Distribution of clustering objects on nodes in the twodimensional CAN protocol

Fig. 2
Fig. 2 Interior point and edge point on the local node

n10Fig. 4
Fig.4The surrounding area with the width of e the cluster i b belongs to upon local clustering, j is the identifier of clusters; the hosting node of i b , k is the identifier of node.
Send the message of D classified as C1 to .

Fig. 5 Fig. 6
Fig. 5 Two cases of merging for cross-node clusters after local clustering

( 1 )
Divide the vector space of data points into partitions and distribute the data accordingly.(2) In the Map phase, execute local clustering for data points in each partition.(3) In the Shuffle phase, send neighboring partition data to the same node.(4) In the

Fig. 7
Fig. 7 Maximum or minimum value gathering based on the competition tree Fig. 8 Randomly generated clustering objects

Fig. 9
Fig. 9 Time efficiency in different amounts of data , where N is the number of data points, and M is the number of nodes.The number of data points that each node processes is approximately N M when load balancing is taken into account.Therefore, the time complexity of DBSCAN on a local node is O N M log N M À Á .During the clustering process, the time complexity of network communication is O CK 2