A penalized criterion for selecting the number of clusters for K-medians

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.


Introduction
Clustering is an unsupervised machine learning technique that involves grouping data points into a collection of groups based on similar features.Clustering is commonly used for data compression in image processing, which is also known as vector quantization (Gersho and Gray, 2012).There is a vast literature on clustering techniques, and general references regarding clustering can be found in Spath (1980); Jain and Dubes (1988); Mirkin (1996); Jain et al. (1999); Berkhin (2006); Kaufman and Rousseeuw (2009).Classification methods can be categorized as hard clustering also referred as crisp clustering (including K-means, K-medians, and hierarchical clustering) and soft clustering (such as Fuzzy K-means (Dunn, 1973;Bezdek, 2013) and Mixture Models).In hard clustering methods, each data point belongs to only one group, whereas in soft clustering, a probability or likelihood of a data point belonging to a cluster is assigned, allowing each data point to be a member of more than one group.We focus here on hard clustering methods.The most popular partitioning clustering methods are the non sequential (Forgy, 1965) and the sequential (MacQueen, 1967) versions of the K-means algorithms.The aim of the K-means algorithm is to minimize the sum of squared distances between the data points and their respective cluster centroid.More precisely, considering X 1 , ..., X n random vectors taking values in R d , the aim is to find k centroids {c 1 , ..., c k } minimizing the empirical distortion Nevertheless, in many real-world applications, data collected may be contaminated with outliers of large magnitude, which can make traditional clustering methods such as K-means sensitive to their presence.
As a result, it is necessary to use more robust clustering algorithms that produce reliable outcomes.One such algorithm is K-medians clustering, which was introduced by MacQueen (1967) and further developed by Kaufman and Rousseeuw (2009).Instead of using the mean to determine the centroid of each cluster, The paper is organized as follows.We provide a recap of two different methods for estimating the geometric median, followed by the introduction of three K-median algorithms ("Online", "Semi-Online", and "Offline").In section 3, we propose a penalty shape for the proposed penalized criterion and give an upper bound for the expectation of the distortion at empirically optimal codebook with size of optimal number of clusters which ensure our penalty function.We illustrate the proposed approach with some simulations and compare it with several methods in section 4. Finally, the proofs are gathered in section 5.All the proposed algorithms are available in the R package Kmedians on CRAN https://cran.r-project.org/package=Kmedians.

Geometric Median
In what follows, we consider a random variable X that takes values in R d for some d ≥ 1.It is well-known that the standard mean of X is not robust to corruptions.Hence, the median is preferred to the mean in robust statistics.The geometric median m, also called L 1 -median or spatial median, of a random variable X ∈ R d is defined by Haldane (1948) as follows: For the 1-dimensional case, the geometric median coincides with the usual median in R. As Euclidean space R d is strictly convex, the geometric median m exists and is unique if the points are not concentrated around a straight line (Kemperman, 1987).The geometric median is known to be robust and has a breakdown point of 0.5.
Let us now consider a sequence of i.i.d copies X 1 , ..., X n of X.In this paper, we focus on two methods to determine the geometric median.The first one is iterative and consists in considering the fix point estimates (Weiszfeld, 1937;Vardi and Zhang, 2000) with a initial point m0 ∈ R d chosen arbitrarily such that it does not coincide with any of the X i and X t = {i, X i ̸ = mt }.This Weiszfeld algorithm can be a flexible technique, but there are many implementation difficulties for massive data in high-dimensional spaces.An alternative and simple estimation algorithm which can be seen as a stochastic gradient algorithm (Robbins and Monro, 1951;Ruppert, 1985;Duflo, 1997;Cardot et al., 2013) and is defined as follows where m 0 is an arbitrarily chosen starting point and γ j is a step size such that ∀j ≥ 1, γ j > 0, j≥1 γ j = ∞ and j≥1 γ 2 j < ∞.Its averaged version (ASG), which is effective for large samples of high dimension data, introduced by Polyak and Juditsky (1992) and adapted by Cardot et al. (2013), is defined by One can speak about averaging since m j = 1 j j i=1 m i .We note that, under suitable assumptions, both mt and m t are asymptotically efficient (Vardi and Zhang, 2000;Cardot et al., 2013).

K-medians
For a positive integer k, a vector quantizer Q of dimension d and codebook size k is a (measurable) mapping of the d-dimensional Euclidean R d into a finite set of points {c 1 , ..., c k } (Linder, 2000).More precisely, the points c i ∈ R d , i = 1, ..., k are called the codepoints and the vector composed of the code points {c 1 , ..., c k } is called codebook, denoted by c.Given a d-dimensional random vector X admitting a finite first order moment, the L 1 -distortion of a vector quantizer Q with codebook c = {c 1 , ..., c k } is defined by (2) Let us now consider X 1 , ..., X n random vectors ∈ R d i.i.d with the same law as X.Then, one can define the empirical L 1 -distortion as : In this paper, we consider two types of K-medians algorithms : sequential and non sequential algorithm.
The non sequential algorithm uses Lloyd-style iteration which alternates between an expectation (E) and maximization (M) step and is precisely described in Algorithm 1: Algorithm 1: Non Sequential K-medians Algorithm .
For 1 ≤ j ≤ k, m j is nothing but the geometric median of the points in the cluster C j .As m j is not explicit, we will use Weiszfeld (indicated by "Offline") or ASG (indicated by "Semi-Online") to estimate it.The Online K-median algorithm proposed by Cardot et al. (2012) based on an averaged Robbins-Monro procedure (Robbins and Monro, 1951;Polyak and Juditsky, 1992) is described in Algorithm 2: The non-sequential algorithms are effective but the computational time is huge compared to the sequential ("Online") algorithm, which is very fast and only requires O(knd) operations, where n is the sample size, k is the number of clusters and d is dimension.Furthermore, in case of large samples, Online algorithm is expected to estimate the centers of the clusters as well as the non-sequential algorithm Cardot et al. (2012).Then, in case of large sample size, Online algorithm should be preferred and vice versa.

The choice of k
In this section, we adapt the results that have been shown for K-means in Fischer (2011) to K-medians clustering.In this aim, let X 1 , ..., X n random vectors with the same law as X, and we assume that ∥X∥ ≤ R almost surely for some R > 0. Let S k denote the countable set of all {c 1 , ..., c k } ∈ Q k , where Q is some grid over R d .It is important to note that Q represents the search space for the centers.
Since ∥X∥ is assumed to be bounded by R, we consider a grid Q ⊂ B(0, R) (where B(0, R) denotes the closed ball centered at 0 with radius R. A codebook ĉk is said empirically optimal codebook if we have Let ĉk be a minimizer of the criterion W n (c) over S k .Our aim is to determine k minimizing a criterion of the type where pen : {1, ..., n} → R + is a penalty function described later.The purpose of this penalty method is to prevent choosing too large a value for k by introducing a penalty into the objective function.
In this section, we will give an upper bound for the expectation of the distortion at empirically optimal codebook with size of optimal number of clusters which is based on a general non asymptotic upper bound for Theorem 3.1.Let X 1 , . . ., X n be random vectors taking values in R d with the same law as X, and we assume that ∥X∥ ≤ R almost surely for some R > 0. Define W and W n as in (2) and (3), respectively.Then for all 1 ≤ k ≤ n, This theorem shows that the maximum difference of the distortion and the expected empirical distortion of any vector quantizer is of order n −1/2 .Selecting the search space for the centers is crucial because a larger search space results in a higher upper bound.
Theorem 3.2.Let X be a random vector taking values in R d and we assume that ∥X∥ ≤ R almost surely for some R > 0. Consider nonnegative weights {x k } 1≤k≤n such that n k=1 e −x k = Σ.Define W as in (2) and suppose that for all Then: where c = ĉk minimizer of the penalized criterion.
We remark the presence of the weights {x k } 1≤k≤n in penalty function and Σ which depends on the weights in upper bound for the expectation of the distortion at c.The larger the weights {x k } 1≤k≤n , the smaller the value of Σ.So, we have to make a compromise between these two terms.Let us indeed consider the simple situation where one can take {x k } 1≤k≤n such that x k = Lk for some positive constant L and Σ we deduce that the penalty shape is a k n where a is a constant.
where W is defined in (2).
Assume that for every 1 where a is a positive constant that satisfies a ≥ 48 to verify the hypothesis of Theorem 3.2.Using Theorem 3.2 and Proposition 3.1, we obtain: Minimizing the term on the right hand side of previous inequality leads to k of the order n and We conclude that our penalty shape is a k n where a is a constant.In Birgé and Massart (2007), a data-driven method has been introduced to calibrate such criteria whose penalties are known up to a multiplicative factor: the "slope heuristics".This method consists of estimating the constant of penalty function by the slope of the expected linear relation of −W n ( ĉk ) with respect to the penalty shape values Estimation of constant a: Let denote c * = arg min c∈S W (c) and c k = arg min c∈S k W (c), where S any linear subspace of R d and S k set of predictors (called a model).It was shown in Birgé and Massart (2007); Arlot and Massart (2009); Baudry et al. (2012) that under conditions, the optimal penalty verifies for large n: This gives The term −W n ( ĉk ) with respect to the penalty shape behaves like a linear function for a large k.The slope Ŝ of the linear regression of −W n ( ĉk ) with respect to pen shape (k) is computed to estimate a opt 2 .Finally, we obtain pen(k) := a opt pen shape (k) = 2 Ŝpen shape (k).
Of course, since this method is based on asymptotic results, it can encounter some practical problems when the dimension d is larger than the sample size n.

Simulations
This whole method is implemented in R and all these studied algorithms are available in the R package Kmedians https://cran.r-project.org/package=Kmedians.In what follows, the centers initialization are generated from robust hierarchical clustering algorithm with genieclust package (Gagolewski et al., 2016).

Visualization of results with the package Kmedians
In Section 3, we proved that the penalty shape is a k n where a is a constant to calibrate.To find the constant a, we will use the data-based calibration algorithm for penalization procedures that is explained at the end of section 3.This data-driven slope estimation method is implemented in CAPUSHE (CAlibrating Penalty Using Slope HEuristics) (Brault et al., 2011) which is available in the R package capushe https://cran.r-project.org/package=capushe.This proposed slope estimation method is made to be robust in order to preserve the eventual undesirable variations of criteria.More precisely, for a certain number of clusters k, the algorithm may be trapped by a local minima, which could create a "bad point" for the slope heuristic.The slope heuristic has therefore been designed to be robust to the presence of such points.
In what follows, we consider a random variable X following a Gaussian Mixture Model with k = 6 classes where the mixture density function is defined as where U 10 is the uniform law on the sphere of radius 10 and, In what follows, we consider n = 3000 i.i.d realizations of X and d = 5.We first focus on some visualization of our slope method.To estimate a ≈ 2 Ŝ in the penalty function, it is sufficient to estimate Ŝ, which is the slope of the red curve in Figure 1.As shown in Figure 1, the regression slope is estimated using the last 21 points, as it behaves like an affine function when k is large.In Figure 2 (left), two possible elbows are observed in the curve.Consequently, the elbow method suggests considering either 5 or 6 as the number of clusters.
We would prefer to choose 5 since the elbow point at 5 is more pronounced compared to the one at 6. Therefore, this method is not ideal in this case.Figures 3 to 5 represent the data as curves, which we call "profiles" (the x-label corresponds to the coordinates, and the y-label to the values of the coordinates), gathered by cluster and with the centers of the groups represented in red.We also show the first two principal components of the data using robust principal component analysis components (RPCA) (Cardot and Godichon-Baggioni, 2017).In Figure 3, we focus on the clustering obtained with the K-medians algorithm ("Offline" version) for non contaminated data.In each cluster, the curves are close to each other and also close to the median, and the profiles differ from one cluster to another, meaning that our method separated well the 6 groups.In order to visualize the robustness of the proposed method, we considered contaminated data with the law Z = (Z 1 , ..., Z 5 ) where Z i are i.i.d, with Z i ∼ T 1 where T 1 is a Student law with one degree of freedom.
Applying our method for selecting the number of clusters for K-medians algorithms, we selected the corrected number of clusters.Furthermore, the obtained groups, despite the presence of some outliers in each cluster, are coherent.Nevertheless, in the case of K-means clustering, the method found non homogeneous clusters, i.e. the method assimilates some far outliers as single clusters (see Figure 5).

Comparison with Gap Statistic and Silhouette
In what follows, we focus on the choice of the number of clusters and compare our results with different methods.For this, we generated some basic data sets in three different scenarios (see Fischer ( 2011)) : We applied three different methods for determining the number of clusters : the proposed slope method, Gap Statistic and Silhouette method.For each method, we use four clustering algorithms : K-medians ("Online", "Semi-Online", "Offline") and K-means.For each scenario, we contaminated our data with the law Z = (Z 1 , ..., Z d ) where Z i are i.i.d, with Z i ∼ T 1 where T 1 is a Student law with 1 degree of freedom.Then, we evaluate our method for the different methods and scenarios by considering: • N : The number of times we get the correct value of cluster in 50 repeated trials without contaminated data.
• k : The average of number of clusters obtained over 50 trials without contaminated data.
• N 0.1 : The number of times we get the correct value of cluster in 50 repeated trials with 10% of contaminated data.
• k0.1 : The average of number of clusters obtained over 50 trials with 10% of contaminated data.
In case of well separated clusters as in the scenario (S2), the gap statistics method and silhouette method give competitive results.Nevertheless, for closer clusters, the slope method works much better than gap statistics and silhouette method as in the scenario (S1).The gap statistics method only works in scenario 2 and is ineffective in the presence of contamination.In closer cluster scenarios, it often predicts 1 as the number of clusters.The silhouette method performs moderately well in scenario 2 and very well in scenario 3, but it is globally not as competitive as the slope method, especially in cases of contaminated data.In scenarios 1 and 2 with slope method, Offline, Semi-Online, Online and K-means give better results but in cases of contamination, K-means crashes completely while the other three methods seem to be not too much sensitive.Furthermore, on non-Gaussian data (scenario 3), the K-means method does not work at all.In such cases, K-median clustering is often preferred over K-means clustering.
Overall, in every scenario, Offline, Semi-Online, Online K-medians with the slope method give very competitive results and in the case where the data are contaminated, they clearly outperform other methods, especially the Offline method.

Contaminated Data in Higher Dimensions
We now focus on the impact of contaminated data on the selection of the number of clusters in K-medians clustering, particularly in higher dimensions.We compare our method with Gap Statistic and SigClust (Liu et al., 2008;Huang et al., 2015;Bello et al., 2023) in the Offline setting, as it yields competitive results, as noted in the previous section.Concerning SigClust, it is a method which enables to test whether a sample comes from a single Gaussian or several in high dimension.Then, starting from k = k 0 , we test for all possible pairs of clusters whether the fusion of the two clusters comes from a single Gaussian or not.If the test rejected the hypothesis that the combined cluster is a single Gaussian for all fittings, the same procedure is repeated for k + 1.If there is a fitting for which the test is not rejected, it is considered that these two clusters should be merged, and the procedure is stopped.The optimal number of clusters is then determined as k opt = k − 1.It is important to note that we did not compare with Gap Statistics, as it is computationally expensive, especially in high dimensions.
In this aim, we generate data using a Gaussian mixture model with 10 classes in 100 and 200 dimensions, where the centers of the classes are randomly generated on a sphere with radius 10, and each class contains 100 data points.The data is contaminated with the law Z = (Z 1 , ..., Z d ), where d is the dimension (100 or 200 for each scenario), Z i are i.i.d, with two possible scenarios: Here, T m is the Student law with m degrees of freedom.In what follows, let us denote by ρ the proportion of contaminated data.In order to compare the different clustering results, we focus on the Adjusted Rand Index (ARI) (Rand, 1971;Hubert and Arabie, 1985) which is a measure of similarity between two clusterings and which relies on taking into account the right number of correctly classified pairs.We evaluate, for each scenario, the average number of clusters obtained over 50 trials and the average ARI evaluated only on uncontaminated data.We observe that in the case of non-contamination, we obtain similar results across all methods.However, in the presence of contamination, our method consistently performs well, while others struggle to identify an appropriate number of clusters.With a Student distribution contamination of 1 degree of freedom, our method excels in terms of both the number of clusters and the ARI.The results with a Student distribution contamination of 2 degrees of freedom are comparable to those obtained using the Silhouette method.
In summary, our method demonstrates remarkable robustness in the face of contaminated data, making it a strong choice for clustering in higher dimensions.The comparison with the Silhouette, Gap Statistic, and SigClust in the offline setting reaffirms the effectiveness of our approach, especially when computational efficiency is a critical factor in high-dimensional data.

An illustration on real data
We will first briefly discuss the data we used for clustering, which was provided by Califrais, a company specializing in developing environmentally responsible technology to optimize logistics flows on a large scale.Our goal is to build a Recommender System that is designed to suggest items individually for each user based on their historical data or preferences.In this scenario, the clustering algorithms can be employed to identify groups of similar customers where each group consists of customers share similar features or properties.It is crucial to perform robust clustering in order to develop an effective Recommender System.The dataset includes information on 508 customers, including nine features that represent the total number of products purchased in each of the following categories: Fruits, Vegetables, Dairy products, Seafood, Butcher, Deli, Catering, Grocery, and Accessories and equipment.Therefore, we have a sample size of n = 508 and a dimensionality of d = 9.To apply clustering, we will determine the appropriate number of clusters using the proposed method.Before applying our method, we normalize our data using RobustScaler.This removes the median and scales the data according to the Interquartile Range, which is the range between the 1st quartile and the 3rd quartile.We plotted the profiles of the clusters obtained using our Slope method, Silhouette and Gap Statistic in Figures 6, 7, and 8.We observe that our method indicates 5 clusters, while the Gap Statistic suggests 3 clusters, and Silhouette suggests 2 clusters.Regarding the Silhouette method, the second cluster obtained is not homogeneous, as seen in Figure 7.We obtain 3 clusters with the Gap Statistic method.The important thing to note is that the Gap Statistic method separates the second cluster obtained by Silhouette into two clusters (Cluster 2 and Cluster 3).However, in the third cluster of Gap Statistic (Figure 8), homogeneity is still not achieved.In Figure 6, it can be seen that the clusters generated by our slope method are more or less homogeneous.To establish a connection with the simulations conducted in Section 4.2, for example, in scenario (S1), we observed that Silhouette and Gap Statistics failed to find the correct number of clusters when the clusters are closer.This is reflected here, as the behavior of clients does not change significantly, resulting in close clusters.To provide an overview of our clusters, the first cluster represents customers who regularly consume products from all categories.The third cluster consists of customers who frequently engage with catering products.Clusters 2, 4, and 5 correspond to customers who consume significant amounts of Butcher, Deli, and Catering products at different levels, as depicted in the figure 6.

Conclusion
The proposed penalized criterion, calibrated with the help of the slope heuristic method, consistently gives competitive results for selecting the number of clusters in K-medians, even in the presence of outliers, outperforming other methods such as Gap Statistics, Silhouette, and SigClust.Notably, our method demonstrates excellent performance even in high dimensions.Among the three K-medians algorithms, Offline, Semi-Online, and Online, their performances are generally analogous, with Offline being slightly better.However, for large sample sizes, one may prefer the Online K-medians algorithm in terms of computation time.As discussed in Section 2, it is recommended to use the Offline algorithm for moderate sample sizes, the Semi-Online algorithm for medium sample sizes, and the Online algorithm for large sample sizes.In our real-life data illustration, our proposed method consistently produces more robust clusters and a more suitable number of clusters compared to other methods.In conclusion, our paper presents a robust and efficient approach for selecting clusters in K-medians, demonstrating superior performance even in challenging scenarios.The findings provide practical recommendations for algorithm selection based on sample size, reinforcing the applicability of our proposed method in real-world clustering scenarios.

Some definitions and lemma
First, we provide some definitions and lemmas that are useful to prove Theorems 3.1 and 3.2.

Definitions :
• Let (S, p) be a totally bounded metric space.For any F ⊂ S and ϵ > 0 the ϵ-covering number N p (F , ϵ) of F is defined as the minimum number of closed balls with radius ϵ whose union covers F .
• A Family {T s : s ∈ S} of zero-mean random variables indexed by the metric space (S, p) is called subgaussian in the metric p if for any λ > 0 and s, s ′ ∈ S we have • The Family {T s : s ∈ S} is called sample continuous if for any sequence s 1 , s 2 .., ∈ S such that s j → s ∈ S we have T s j → T s with probability one.
Let Y 1 , .., Y n are independent zero-mean random variables such that a ≤ Y i ≤ b, i = 1, ..., n, then for all λ > 0, Lemma 5.3 (Bartlett et al. (1998), Lemma 1).Let S(0, R) denote the closed d-dimensional sphere of radius R centered at 0. Let ϵ > 0 and N (ϵ) denote the cardinality of the minimum ϵ covering of S(0, R), that is, N (ϵ) is the smallest integer N such that there exist points {y 1 , ..., y N } ⊂ S(0, R) with the property Then, for all ϵ ≤ 2R we have Since, we have N k ways to choose k codepoints from a set of N points {y 1 , ..., y N }, that implies For any codepoints {c 1 , ..., c k } which are contained in S(0, R), there exists a set of codepoints such that In this aim, let us consider q ∈ arg min j=1,..,k x − c j , then min In the same way, considering q ′ ∈ arg min j=1,..,k x − c ′ j , we show min j=1,..,k

Proof of Theorem 3.1
The proof of the Theorem 3.1 is inspired by the proof of Theorem 3 in Linder (2000).
Let us first demonstrate that the family of random variables T where Y i are independent, have zero mean, and By Lemma 5.1, we obtain As the family T Applying Jensen's inequality to the concave function f where we used that ln x = x ln x − x and ln 4 ≤ 3. Thus,

Proof of Theorem 3.2
Theorem 3.2 is an adaptation of Theorem 8.1 in Massart (2007) and Theorem 2.1 in Fischer (2011).
Proof.By definition of c, for all k, 1 ≤ k ≤ n and c k ∈ S k , we have: Consider nonnegative weights {x l } 1≤l≤n such that n l=1 e −x l = Σ and let z > 0. Applying Lemma 5.5 with f (x) = 1 n min j=1,..,l x − c j , a = 0 and b = 2R n for all l, 1 ≤ l ≤ n and all ϵ l > 0 P sup

Figure 2 :
Figure 2: Evolution of W n ( ĉk ) (on the left) and crit(k) (on the right) with respect to k.
It's important to note that, in the case of contaminated data (Figures4 and 5), we only represented 95% of the data to better visualize them.Then, inFigure, 5, Clusters 5, 7, 8, 11 and 12  are not visible since they are "far" outliers.

Figure 3 :
Figure 3: Profiles (on the left) and clustering via K-medians represented on the first two principal components (on the right) without contaminated data.

Figure 4 :
Figure 4: Profiles (on the left) and clustering via K-medians algorithm represented on the first two principal components (on the right) with 5% of contaminated data.

Figure 5 :
Figure 5: Profiles (on the left) and clustering via K-means algorithm represented on the first two principal components (on the right) with 5% of contaminated data.

Figure 6 :
Figure 6: Califrais data: Profiles (on the left) and clustering with Slope method represented on the first two principal components (on the right).

Figure 7 :
Figure 7: Califrais data: Profiles (on the left) and clustering with Silhouette method represented on the first two principal components (on the right).

Figure 8 :
Figure 8: Califrais data: Profiles (on the left) and clustering with Gap Statistics method represented on the first two principal components (on the right).
Cesa-Bianchi and Lugosi (1999), Proposition 3).If {T s : s ∈ S} is subgaussian and sample continuous in the metric p, then
{c 1 , ..., c k } which are contained in S(0, R), there exists a set of codepoints c ′ 1 , ..., c ′ McDiarmid et al. (1989), Massart (2007) : Theorem 5.3).If X 1 , ...X n are independent random variables and F is a finite or countable class of real-valued functions such that a ≤ f ≤ b for all

n
: c ∈ S k is subgaussian and sample continuous in a suitable metric.For any c, c ′ ∈ S k define p(c, c ′ ) = sup ∥x∥≤R min j=1,..,k x − c j − min j=1,..,k x − c ′ j , and p n (c, c ′ ) = √ np(c, c ′ ), p n is a metric on S k .Since we have: T

Table 1 :
.1 Comparison of the number of times we get the right value of clusters and the averaged selected number of clusters obtained with the different methods without contaminated data and with 10% of contaminated data.

Table 2 :
Comparison of the selected number of clusters and the averaged ARI obtained obtained using different methods with respect to the proportion of contaminated data for