Entropic principal component analysis using Cauchy–Schwarz divergence

Modern pattern recognition applications are frequently associated with high-dimensional datasets. In the last decades, different approaches have been proposed to address the curse of dimensionality phenomena present in this kind of data. Principal component analysis is a classic method that even today is widely used for this purpose. However, its procedure is based in the covariance matrix, which is built by the feature vectors scalar product in a point-wise fashion. This makes it very sensible to noise and outliers. This work presents a patch-based mapping to an entropic space. Given a data sample neighborhood, each feature value set is mapped to an univariate Gaussian distribution described by its parameters. Then, each scalar coordinate of a data sample is replaced by the parameter tuple that describes each feature. The difference between two data sample vectors in the entropic space is defined as the vector where each scalar coordinate is given by the stochastic divergence between two probability distributions. The covariance matrix is still defined by the scalar product between the difference vector of a data sample and the average sample, so it can be used with transparency in the original PCA algorithm. This patch mapping in the entropic space aims to mitigate the effect of noise and outliers. Experiments adopting the Cauchy–Schwarz divergence show that the new framework can outperform several existing dimensionality reduction algorithms in cluster analysis tasks in multiple real datasets.


Introduction
High-dimensional data are present in several domains of science.A big quantity of features and samples is common in modern pattern recognition and machine learning applications.While a large number of samples is good for those tasks, the increase in the feature size can bring negative consequences [3,4,7,10,12,22,27].The curse of dimensionality phenomena states that, as the number m of features grows, more samples are needed to approximate the data governing function [5,8,18].Thus, a large sample size n is required to extract relevant information from high-dimensional data.However, in real contexts, n is limited or even scarce in relation to m.Therefore, a natural way to mitigate this problem is to reduce the data dimensionality m.
The main purpose of pattern recognition is to develop models for automatic regularities discovery in data through computational algorithms.One of the key steps in this process is the definition of a suitable distance measure between samples [11,24].Being able to properly quantify observations similarity is necessary for any kind of data analysis.In this context, unsupervised metric learning aims at finding a suitable distance function for a given dataset.
Dimensionality reduction (DR) algorithms are mathematical tools for data analysis and metric learning.The intuition behind these procedures is that, usually, the observed data samples lie along a low-dimensional structure embedded in a high-dimensional input.The low-dimensional space reflects some unknown underlying parameters (i.e., local coordinates) that are encoded in the original feature space.Attempting to uncover this hidden structure in a dataset is the major goal of DR.
It has been shown that DR has a strong relation to metric learning because it derives a distance metric that better quantify dissimilarity between samples [2,11,20,24,25].So, besides helping data visualization and alleviating the computational burden, DR also handle the curse of dimensionality by learning an adaptive data-dependent similarity measure that leads to a more compact data representation.
Among all DR methods, principal component analysis (PCA) [9] is still one of the main algorithms used by researchers, mainly due to its tradition, simplicity and low computational complexity.PCA is based on finding the orthogonal directions that maximize the data variance, which is optimal from a data representation point of view, since it is equivalent to the mean square error minimization between the original and the reduced representation.For this reason, after the PCA transformation, the data is usually organized in clusters with large scattering, which is undesirable for classification.Another problem in PCA is that, its strategy is based in the covariance matrix built by the feature vectors scalar product in a point-wise fashion, which makes it very sensible to noise and outliers [14].
This work presents a patch-based mapping from the usual feature space to an entropic feature space.Given a data sample neighborhood, each feature value set is mapped to an univariate Gaussian distribution described by its parameters.Then, each scalar coordinate of a data sample is replaced by the parameter tuple that describes each feature.The difference between two data sample vectors in the entropic space is defined as the vector where each scalar coordinate is given by the stochastic divergence between two probability distributions.The covariance matrix is still defined by the scalar product between the difference vector of a data sample and the average sample, so it can be used with transparency in the original PCA algorithm.Therefore, the effect of noise and outliers is expected to be sweetened due to the patch mapping to the entropic space.
Overall, experiments adopting the Cauchy-Schwarz divergence show for several real datasets that the the obtained clusters by the new framework show a lower intra-class scattering in comparison to other DR methods.The remainder of the paper is organized as follows: Sect. 2 provides the necessary theoretical foundation.Section 3 describes the experiments and exhibits the results.Section 4 discuss the obtained conclusions.Finally, Sect. 5 presents some future works possibilities.

PCA using Cauchy-Schwarz divergence
A dataset is defined as being the set X = { x 1 , x 2 , . . ., x n }, where x i ∈ R m .If, for all i, x i is linked with its k nearest neighbors, the KNN graph is defined as G = (V , E), where |V | = n.The Euclidean distance can be used for this linkage, assuming that a neighborhood is an Euclidean subspace itself [16].A patch P i is defined as { x i } ∪ { x j ∈ N (i)}, with N (i) being the neighborhood of x i .So is the m × (k + 1) matrix that represents the i-th patch.We assume that each row of the matrix P i is a sample of size k + 1 of a univariate random variable x, characterized by a probability density function p(x; θ), where θ ∈ R L is a vector of L parameters.In this study, we consider a Gaussian model, that is, L = 2 and θ 1 = μ denotes the mean and θ 2 = σ 2 denotes the variance.So each random variable corresponds to one of m input features.Each P i is mapped to a m-dimensional vector of 2D tuples, where each tuple j for j = 1, . . ., m has the maximum likelihood estimators of the parameters for each one of the features.In other words, we compute the sample mean and variance of each line of the matrix P i .The entropic feature vector p i for the patch P i is given by: where each component is a tuple of two parameters: Figure 1 illustrates the idea of this mapping from a patch P i to a parametric feature vector p i .The set of all p i , for i = 1, 2, . . ., n defines the entropic feature space.We can associate to the entropic feature space, a centroid, which represents the average distribution: Let the entropic difference between two vectors p i and p j in the entropic feature space be the Cauchy-Schwarz divergence between each one of the tuples in the vectors: where D C S ( p, q) is the Cauchy-Schwarz divergence [6] between probability density functions p and q: Fig. 1 Mapping a patch P i to a parametric feature vector p i The Cauchy-Schwarz divergence is equivalent to the the relative entropy (i.e., the Kullback-Leibler divergence) for the quadratic entropy.In this study, we assume a univariate Gaussian model for each feature, that is, we have the distributions p(x| θ i ) and q(x| θ j ) as N (μ 1 , σ 2  1 ) and N (μ 2 , σ 2 2 ), respectively, which leads to [19]: Finally, using Eq. ( 7) in Eq. ( 5), we define the entropic covariance matrix C where d C S ( p i , ˜ p) is a m-dimensional vector of Cauchy-Schwarz divergences between the local distributions estimated from each patch and the average distribution: The entropic covariance matrix can then be used with transparency in the original PCA algorithm.So, the final PCA projection matrix (responsible for the linear projection from old to new coordinates) can be normally built with the eigenvectors of C.

Traditional covariance matrix behavior under data perturbation
In the traditional feature space, a vector x i ∈ R m is written as The dataset is defined by the set of x i for i = 1, . . ., n.The dataset mean is and the dataset covariance matrix is defined as In the presence of addictive Gaussian noise with zero mean, each x i will be mapped to So the covariance matrix in the noisy dataset will be given by Notice that, if x i is an outlier, then x i − μ can explode.That is, outliers can cause a significant damage in the covariance matrix under data perturbation.

Entropic covariance matrix behavior under data perturbation
Instead of using x i directly, the stochastic distances are functions of the differences between the patches means.As the data are corrupted by noise, we have As E[ x i ] = E[ x i ] + 0, the patches local means are not affected by the noise: So the global mean will be the same: The entropic covariance matrix is a function of the local means, which are capable to filter the noise: So the noisy entropic covariance matrix is equals to the entropic covariance matrix.That is, the presence of noise practically does not affect the entropic covariance matrix.Even in outliers presence, the noise damage will not be amplified as strongly as in the traditional feature space case.Additionally, the Cauchy-Schwarz divergence is defined in terms of a logarithmic function, so it grows very slowly asymptotically in a saturating behavior.This implies that the entropic covariance matrix would be even less sensitive to outliers.As d cs is upper bounded, the eigenvalues of C are limited, tending to not explode, which is numerically more stable and may avoid the matrix ill-conditioning.

Influence of outliers in clustering assessment
As illustrated by Fig. 2, the presence of outliers raises the intra-cluster scattering (R) of both groups 1 and 2 and reduces the inter-cluster separation (D).Notice that this idea is generalizable to any number of groups.Therefore, the clustering quality of the dispersion generated by a DR method can be used to evaluate if the influence of outliers has been diminished.Additionally, there is a natural relation between a good clustering structure and a better classification accuracy given that more compact and well separated clusters propitiate a lower classification error.
The goal is to assess the quality of the clusters obtained after feature extraction.A cluster is a set of samples that belongs to the same class in the original dataset, given that all samples were labeled in all datasets (i.e.every sample belongs to some class).We used the Silhouette Coefficient (SC) [15] to measure the similarity between a given data sample and its own cluster (cohesion) in comparison to different clusters (separation).This measure provides a quantitative way to analyze the consistency within clusters.The idea is to measure, for all clusters, how tight the cluster is.A high SC indicates a low intra-class variability (i.e. a low scatter dispersion).We can find the results in Table 2, where column CSPCA denote the proposed entropic method under Gaussian hypothesis.The best result in a line is boldfaced.It is possible to see that in almost half of the cases (seven) CSPCA presented itself as the more adequate method under cluster assessment comparison via Silhouette coefficient.So it is the method that was superior in the majority of the cases (the second one is UMAP which was the superior method in just three cases).We can notice also that the CSPCA has the maximum Silhouette average value too.
Regarding parameter tuning, there are two relevant parameters for the analysis: target dimensionality d and the KNN neighborhood value K .In respect to d, for all DR methods, we tested exhaustively all values for d and computed the silhouette coefficient; then, we selected the d that maximizes those metrics.In respect to K , besides CSPCA, all manifold learning algorithms adopted are influenced by this parameter too.For the estimation of this parameter, we adopted a strategy analogous to the strategy for d: a exhaustive search guided by performance, where we defined the set of possible values of K by considering an initial value, and an increment window based on the number of samples n.An intuition behind this choice is that, a smaller K is usually preferred in small datasets to preserve patch locality.But, for suitable parameter estimation, the trade-off between a large enough sample size and locality preservation, must be considered.We annotate in Table 3 the running times for the mfeat-fourier dataset which has a large quantity of samples (2000) and features (76).We considered only CSPCA and the manifold learning algorithms, as they are the methods which utilize K .Regarding execution processing resource allocation, CSPCA only consumes 20% of our hardware capacity, while any manifold learning algorithm needs at least 50% with the worst case being t-SNE which takes almost 100%.

Conclusions
This work presents an patch entropic unsupervised metric learning framework employing the Cauchy-Schwarz divergence and applied to the PCA method.We show theoretically that the framework is capable to diminish the noise and outliers influence in the PCA and how this can be measured via cluster assessment.Then, we show experimentally using the Silhouette cluster validation index that CSPCA can be a competitive candidate between the existing unsupervised feature extraction options, being capable to be individually superior to other compared method in several datasets or to be superior to all the compared algorithms in some cases.
It is important to note that, in the metric learning field, many techniques exist precisely because each one can be more adequate to a certain scenario than the other.As this adequation is not clear beforehand, the conclusions can only be obtained experimentally.Our experiments corroborate this idea as it shows that there is often a particular case that some method will be superior to all others, and therefore, there is no single method that is superior to all others for all datasets experimented.

Future works
-In order to enrich the comparison, other metrics than Silhouette can be added to the analysis, such as Gamma, Dunn, Root Mean Square Standard Deviation, R-Squared, Calinski-Harabasz and Davies-Bouldin.In this case, a multi-comparison hypothesis testing should be used via the Friedman and the post-hoc Nemenyi tests for example.-Regarding the graph construction step, a supervised KNN approach (which considers only neighbors that belong to the same class of the central data sample) can be employed.Another possibility is the use of the e-ball strategy in place of the KNN one.In all cases, other metrics than Euclidean can be adopted (e.g.Jaccard, Minkowski, Cosine).All these variations in the graph construction could possibly improve the proposed method performance.-Another possible improvement to the framework is to extend it to other statistical models.
If the dataset has multi-modal features, Gaussian Mixture Model and Kernel Density Estimation can be used.The generalization to other probability densities is straightforward as the Cauchy-Schwarz divergence can be calculated to other distributions.-To conclude, an important point in our vision is that this same new entropic approach proposed for PCA could be also incorporated in other methods such as Linear Discriminant Analysis and Isometric Feature Mapping for example.
Author Contributions AL (supervisor) was responsible for conceptualization, computational implementation and final revision.EN (Ph.D. student) was responsible for writing the paper, computational implementations and final revision.

Fig. 2
Fig. 2 Influence of outliers in the clustering quality

Table 2
Silhouette coefficients