A layered parallel algorithm for data clustering via fuzzy c-means technique

In this study, a layered parallel algorithm via fuzzy c-means (FCM) technique, called LP-FCM, is proposed in the framework of Map-Reduce for data clustering problems. The LP-FCM mainly contains three layers. The ﬁrst layer follows a parallel data partitioning method which is developed to randomly divide the original dataset into several subdatasets. The second layer uses a parallel cluster centers searching method based on Map-Reduce, where the classic FCM algorithm is applied to search the cluster centers for each subdataset in Map phases, and all the centers are gathered to the Reduce phase for the conﬁrmation of the ﬁnal cluster centers through FCM technique again. The third layer implements a parallel data clustering method based on the ﬁnal cluster centers. After comparing with some famous classic random initialization sequential clustering algorithms which include K-means, K-medoids, FCM and MinMaxK-means on 20 benchmark datasets, the feasibility in terms of clustering accuracy is evaluated. Furthermore, the clustering time and the parallel performance are also tested on some generated large-scale datasets for the parallelism.

more than one cluster with different membership values ranging from [0, 1]. Additionally, the sum of the membership values for each data point must be equal to one [2,38]. 91 Let X = {x i } N i=1 be a collection of instances in n-dimensional vector space, and c 92 (2 ≤ c ≤ N ) denotes the number of clusters. An optimal c partition is realized iteratively 93 by minimizing the weighted sum of squared error objective function: where u ij = u(x i , c j ) ∈ [0, 1] is the degree of membership of x i belonging to the j th cluster, 95 w (1 < w < ∞) is a fuzzification coefficient on each fuzzy membership, c j is the center of 96 the j th cluster, and d 2 (x i , c j ) is a square distance measure between the instance x i and the 97 center c j . The detailed iterative process can be obtained in Algorithm 1. Calculate the cluster centers with U k : Update the membership matrix U k+1 using: where d(x i , c j ) = ||x i − c j || 2 .   In general, the Mapper function processes the input data and produces some interme- can be illustrated in Fig. 2. In what follows, we will elaborate the layers one by one. and shuffles each sub-dataset. Finally, the randomized data (large light red cylinder) are formed. The second job divides the randomized data into several sub-datasets (medium light red cylinders) and split each sub-dataset into several small data sets (small blue cylinders). Finally, the small data sets with same key are combined into one sub-dataset (medium yellow cylinder); (2) Layer 2 applies FCM to each sub-dataset (medium yellow cylinder) to confirm the centers (green purple/red/green small circle). The centers are merged into a new dataset for applying FCM to determine the final centers. (3) Layer 3 labels each sample with the nearest center through calculating the distance between the sample and each center.

The framework of Map-Reduce
Finally, the samples (small purple/red/green cylinder) belonging to the same category are gathered to a cluster (medium purple/red/green cylinder). 135 In this layer, we develop a parallel data partitioning method to randomly divide a 136 dataset into several subdatasets through applying Map-Reduce technique. As shown in 137 Fig. 2, the developed method mainly contains two Map-Reduce jobs. The first Map-Reduce job is a parallel data randomizing method, which is used to randomize the data 139 samples, and the second Map-Reduce job is a parallel data partitioning method aiming to 140 divide the randomized dataset into several subdatasets. 141 Firstly, in order to randomize a large-scale dataset in a parallel processing way, the   Suppose dataset X can be partitioned into m subsets: X 1 , X 2 , ..., Xm. 5 In the j th RandomizedDataMapper:
Then, the second Map-Reduce job in this layer is aimed to divide the randomized 153 dataset into several subdatasets. There are also two parameters that need to be determined Suppose dataset X can be partitioned into m subsets: X 1 , X 2 , ..., Xm. 5 In the j th SearchCentersMapper: 6 Apply the classic FCM method (Algorithm 1) on dataset X j to confirm cluster centers 8 In the SearchCentersReducer: phase, the Euclidean distance between each sample in X j and each center is firstly calcu-181 lated. Then, the index of centers with the nearest distance to a sample is selected as the 182 key, meanwhile, the sample is treated as the value. As the outputs with the same key of   Suppose dataset X can be partitioned into m subsets: X 1 , X 2 , ..., Xm. 5 In the j th ClusterDataMapper:

The proposed LP-FCM clustering algorithm
On basis of the previous work, the layered parallel clustering algorithm via FCM tech-      From the values recorded in Table 2, we can intuitively find that, among these al-  In Table 3, the generated 6 datasets are described in detail.   Table 3 to test the execution

The parallel performance analysis 314
In addition, we also compute some evaluation indices such as Speedup, Scaleup and Then, for the Scaleup performance, the following Fig. 6   Furthermore, we also study the Sizeup performance of the proposed LP-FCM algorithm.

347
During the experiment, the processors are fixed as 5 and 6, respectively. Meanwhile, the 348 used datasets are 1 to 6 times of the original data Covertype (25MB), respectively. The 349 detailed tendencies are summarized in Fig. 7. From the curves, the proposed approach 350 shows a good Sizeup performance in the given parallel system.

352
In this study, a parallel algorithm named LP-FCM is proposed for data clustering prob-