A kernel Principal Component Analysis (kPCA) digest with a new backward mapping (pre-image reconstruction) strategy

Methodologies for multidimensionality reduction aim at discovering low-dimensional manifolds where data ranges. Principal Component Analysis (PCA) is very effective if data have linear structure. But fails in identifying a possible dimensionality reduction if data belong to a nonlinear low-dimensional manifold. For nonlinear dimensionality reduction, kernel Principal Component Analysis (kPCA) is appreciated because of its simplicity and ease implementation. The paper provides a concise review of PCA and kPCA main ideas, trying to collect in a single document aspects that are often dispersed. Moreover, a strategy to map back the reduced dimension into the original high dimensional space is also devised, based on the minimization of a discrepancy functional.


Introduction
The kernel Principal Component Analysis (kPCA) is a very effective and popular technique to perform nonlinear dimensionality reduction Schölkopf et al. (1998b). It is applied to a large variety of fields: image processing and signal denoising Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999), face recognition Wang (2012), credible real-time simulations in engineering applications González et al. (2018); Lopez et al. (2018); Díez et al. (2018), health and living matter sciences Shiokawa et al. (2018); Widjaja et al. (2012); Grassi et al. (2014)... kPCA is often presented giving the general ideas and avoiding the fundamental mathematical details. These details are soundly described only in few selected papers. In particular, when it comes to the forward mapping (or dimensionality reduction, see Izenman (2008)) and to the backward mapping or pre-image reconstruction, see Snyder et al. (2013).
This paper aims at providing a kPCA digest. That is, a condensed description with all the details necessary to understand and implement the method, alternative and complementary to the existing literature. Besides, motivated by different alternatives to achieve the backward mapping (or pre-image reconstruction) available in the literature Kwok and Tsang (2004); Mika et al. (1999); Wang (2012); Schölkopf et al. (1998a); Rathi et al. (2006); Zheng et al. (2010), a new approach is proposed based on the minimization of a discrepancy functional (residual). This novel idea allows dealing with the very frequently used case of Gaussian kernel in a straightforward and efficient manner.
2 Principal Component Analysis (PCA) 2.1 Diagonalizing the covariance matrix Vectors x 1 , x 2 , . . . , x ns in IR d are seen as n s samples of some (possibly random) variable x ∈ IR d . It is assumed that n s d such that the sample is representative of the variability of x. The samples are collected in a d × n s matrix X = [x 1 x 2 · · · x ns ].
PCA aims at discovering a linear model of dimension k, being k d, properly representing the variability of x. Thus, eliminating the intrinsic redundancy in the d dimensions of vector x.
The covariance d × d matrix is defined as In fact, the symmetric and positive definite matrix C stands for the covariance matrix of x if the mean value of x is zero. This is not a a loss of generality because it simply requires subtracting the mean value of x to all the columns in X. Diagonalizing C results in finding a diagonal matrix Λ of eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ d ≥ 0 and a unit matrix U such that Matrix U describes an orthonormal basis in IR d , the basis formed by its columns U = [u 1 u 2 · · · u d ].
Vector z = U T x is the expression of x in this new basis (note that x = Uz). Thus, matrix Z = U T X is the corresponding matrix of samples of z. The covariance of z is ZZ T = Λ. Being the covariance of z diagonal, the d components of z are uncorrelated (linearly independent) random variables.

Reducing the dimension
The trace of C, which is equal to the trace of Λ and therefore the sum of λ i , for i = 1, . . . , d, is the total variance of the vector sample. If eigenvalues decrease quickly, a reduced number of dimensions k is collecting a significant amount of the variance. This is the case if, for some small tolerance ε, k is such that Thus, eigenvalues from k + 1 to d are neglected, and consequently the last d − k columns of matrix U (or the last rows of matrix U T ) are not expected to contribute to describe the variability of x. Accordingly, the last d−k components of vector z are suppressed without significant loss of information.
in the reduced-dimension space. Thus, the samples in X map into Z = U T X of reduced dimension k × n s . The backward mapping (from reduced-dimension space IR k to full dimension IR d ) can be seen as a truncation of relation x = Uz, that is rewritten as being [z] i , the i-th component of vector z. In matrix form when restricted to the samples, this reads The reduced-order models is interpreted as taking x in the k-dimensional linear manifold (or subspace) generated by the basis {u 1 , u 2 , · · · , u k }. The error introduced in the reduction of dimensionality (from d to k) is associated with the discrepancy X − U Z and is decreasing as the tolerance ε decreases (and k increases).

Singular Value Decomposition (SVD)
: an alternative to diagonalization.
The SVD provides a factorization of d × n s matrix X of the form being U and V unit matrices of sizes d × d and n s × n s , respectively, and Σ a diagonal d × n s matrix. The singular values of X, σ 1 ≥ σ 2 ≥ · · · ≥ σ d ≥ 0 are the diagonal entries of Σ (recall n s d).
Note that the diagonalization of C is a direct consequence of the SVD Thus, the eigenvalues of C are precisely the squared singular values of X, that is λ i = σ 2 i , for i = 1, . . . , d.

Permuting d and n s .
The n s × n s matrix equivalent to C taking X T as input matrix (organizing data by rows instead of columns), is It is denoted as Gramm matrix because it accounts for all the scalar products of the samples. This perspective of the problem aims at a reduction of the number of samples n s , while keeping the representativity of the family. This methodology is often named as Proper Orthogonal Decomposition (POD): it coincides with PCA, but aiming at reducing the number of samples n s instead of the dimension d.
However, the fact of changing X by X T is not relevant when using SVD. Namely, the SVD (7) reads And directly provides a diagonalization of the large (n s × n s ) Gramm matrix where the diagonal n s × n s matrix Λ = Σ T Σ has the same nonzero entries λ i , for i = 1, . . . , d, as Λ. The dimensionality reduction is performed exactly in the same way as described in section 2.2, allowing to reduce the dimension of the space of the samples from n s to k. That is, to describe with enough accuracy any of the n s samples (or any other vector pertaining to the same family) as a linear combination of k vectors which are precisely the first k columns of matrix V.

The equivalence of diagonalizing C and G
When diagonalizing the covariance matrix C as indicated in (2), the columns of the transformation matrix U are precisely the eigenvectors of C, that is The same happens with the Gramm matrix G and matrix V in (9), namely being v i the i-th column of V. It is worth noting that for the for i = d+1, . . . , n s , the columns v i of matrix V correspond to the zero eigenvalue (they describe the kernel space of the matrix).
The fact that the SVD described in section 2.3 provides in one shot both U and V suggests that the computational effort provided in diagonalizing C (and therefore obtaining U, see (2)) is equivalent to the cost of diagonalizing G (and obtaining V). This is not obvious at the first sight, because the size of C is d × d and the size of G is n s × n s , hence much larger.
This equivalence actually holds. Computing vectors u i , for i = 1, . . . , d, is equivalent to compute vectors v i . This idea is one of the cornerstones of the kernel Principal Component Analysis (kPCA) method.
Recalling (1), isolating u j from the right-hand-side of (10), and using Appendix A, yields where, assuming that U is available, the n s × d matrix B is introduced such that the entry with indices , j, for = 1, . . . , n s and j = 1, . . . , d, reads Matrix B is computable and coincides precisely with the first d columns of matrix V, see Appendix B, that is The reduced variable z of dimension k is introduced in (4) . For the samples in the training set, that is for x j , j = 1, . . . , n s , the i-th component of the reduced-dimensional samples read for i = 1, . . . , k ≤ d. That is, the compact expression for the construction of the k × n s matrix Z of samples in the reduced space reads The backward mapping described in (6) is also written in terms of V , namely Equation (17) stands because a straightforward rearrangement of (12) taking into account (14) shows that U = XV .
Note that, when compared with equation (6), in (17) the unknown X appears also in the right-hand-side when it has to be expressed in terms of V (when U is not available, as it is the case in section 4). This is one of the difficulties to overcome in order to devise a backward mapping strategy for kPCA.

kernel Principal Component Analysis
As shown above, PCA is a dimensionality reduction technique that, for some d-dimensional variable x, discovers linear manifolds of dimension k d where x lies (within some tolerance ε). In many applications, reality is not as simple and the low-dimensional manifold characterizing the variable is nonlinear. In these cases PCA fails in identifying a possible dimensionality reduction. Several techniques allow achieving nonlinear dimensionality reduction, see Bishop (2006) and references therein. Among them, kernel Principal Component Analysis (kPCA) Schölkopf et al. (1998b); Rathi et al. (2006) is appreciated for its simplicity and easy implementation.

Transformation to higher-dimension D
The kPCA conceptual idea is conceived by introducing an arbitrary transformation Φ from IR d to IR D for some very large dimension D d. This transformation into a large dimensional space is expected to untangle (or to flatten) the nonlinear manifold where x belongs. That is, Φ and D are expected to be such that the variable x = Φ(x) mapped into IR D , lies in a linear low-dimensional manifold that is readily identified by the PCA. In other words, PCA is to be applied to the D × n s matrix containing the transformed samples Assuming that Φ (and therefore D) are known, the standard version of PCA is applied; it diagonalizes the D × D covariance matrix C = X X T , as described in section 2. This introduces two obvious difficulties: first, Φ is unknown and, second, the dimension D required to properly untangle the manifold is typically very large (in particular much larger than n s ) and consequently the computational effort to diagonalize C is unaffordable.
Note however that the transformed Gramm matrix G = X T X is of size n s ×n s (same as G). Moreover, as shown in section 2.4, diagonalizing G reduces the dimension in the same way as diagonalizing C .
Thus, in practice, the transformation Φ is mainly required to compute G .

The kernel trick
As noted above, the a priori unknown Φ is expected to map the nonlinear manifold into a linear one in a higher-dimensional space. In principle, it is not obvious to determine a proper Φ that aligns the samples into a linear subspace of IR D . The kernel trick consists in defining directly G . Note that the generic term of matrix G reads for i, j = 1, . . . , n s . The kernel idea is to introduce, instead of function Φ(·), some bivariate symmetric form κ(·, ·) such that Comparing (20) and (21) reveals that selecting some κ(·, ·) is, for all practical purposes, equivalent to selecting some Φ(·). A typical choice for the kernel is the Gaussian that reads Any other choice is valid provided that the matrix G generated with the sample is symmetric and positive definite (recall that it must be a Gramm matrix). Many other alternative kernels are proposed in the literature Schölkopf et al. (1998b); Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999). The idea is to select the kernel providing the best dimensionality reduction (the lower value of k).

Centering the kernel
As mentioned above, the samples have to be centered to properly use PCA. Note however that selecting some arbitrary Φ(·) does not guarantee that the transformed sample remains centered. However, if Φ(·) is available, the operation to center X in (19) is straightforward. Similarly, selecting κ(·, ·) produces a matrix G which is, in general, not centered. If Φ was known, centering the transformed sample requires redefining x i (changing x i into x i ) as In the following, the centered magnitudes are denoted with an over bar, like x i in (23). Thus, the generic term of the centered Gramm matrix, analogously to (20), Recalling (21), this matrix results as being 1 [ns×ns] ∈ IR ns×ns the n s ×n s matrix having all its entries equal to one.
Note that expression (24) provides the centered Gramm matrix G after a simple algebraic manipulation of matrix G directly generated by kernel κ(·, ·). Accordingly, column g j of matrix G is given by

Diagonalization and dimensionality reduction
Recall that PCA is to be applied to the transformed sample X . However, C is not available: only G and its centered version G are computable. Thus, the idea is to use POD (diagonalize G , as shown in section 2.4) and invoque the idea of section 2.5. The diagonalization of G yields The eigenvalues of Λ , λ 1 ≥ λ 2 ≥ · · · ≥ λ ns ≥ 0, are expected to decrease such that the first k (with k min(d, n s )) collect a significant amount of the full variance, with a criterion associated with some small tolerance ε, see (3). Accordingly, the reduced version of V is V of size n s × k (taking the first k columns of V ). The dimensionality reduction of the samples is performed as shown is section 2.5. That is, the k × n s matrix of samples in the reduced space Z is computed as This is equivalent to compute each vector z j (the j-th sample x j mapped into the reduced space) as which is precisely the forward mapping into the reduced space IR k of the sample The question arises on how to map a new element x new ∈ IR d which does not belong to the initial training set. This mapping is described in both (4) Figure 1: Illustration the dimensionality reduction using kPCA. and (15) for the linear case (PCA). In the nonlinear case the forward mapping follows the idea of (28) with a vector g new ∈ IR ns defined from the kernel and the elements of the sample (the training set). Vector g new is such that its i-th component reads Vector g new is centered replacing g j by g new in expression (25) to obtain g new . The mapping of x new into the reduced space is finally obtained as Figure 1 shows how samples in the input space, x j ∈ IR d are mapped forward into z j ∈ IR k , in the reduced space. This is ideally performed passing through the feature space, using PCA to reduce the dimensionality of the samples Φ(x j ) ∈ IR D , and then projecting them into the reduced space IR k (grey dotted arrows in Figure 1). In practice, the feature space is never used. With kPCA, the kernel trick introduces an alternative strategy (with no explicit definition of Φ) that directly defines a mapping forward from x j ∈ IR d to z j ∈ IR k (represented by the blue dashed arrow in Figure 1). 4 kPCA backward mapping: from z to x As noted in section 2.5, having only V does not allow properly mapping backwards an element z ∈ IR k into its pre-image x ∈ IR d , (red dashed arrow in Figure 1).
The typical strategy to perform the backward mapping consists in seeking the element x as a linear combination (or weighted average) of the elements of the training set, that is for some weights w j , j = 1, . . . , n s . Note that the weights have to be computed as a function of the available data: that is, z and the samples in the training set. Often, the weights are explicitly computed in terms of the distances from z to all the samples mapped forward into the reduced space. That is, the distances d i = z − z i are computed in the reduced space and the weights w i are explicitly given as an inverse function of d i (the larger is the distance, the lower is the weight). Of course, this introduces some arbitrariness in the definition of the weights. There are many functions that decrease with the distance d: w = 1/d; w = 1/d 2 ; w = exp(−d)... they all provide sensible results but with non-negligible discrepancies.
Here, we propose to select weights with an objective criterion, independent of any arbitrary choice. The idea is to adopt the form of (31) and define the vector of unknown weights w = [w 1 . . . w ns ] T . Then, vector g computed following (29) becomes a function of the unknown weights as Vector g (w) is centered using (25) to become g (w). Then for a given value of z , the following discrepancy functional is introduced: Finally, the weights w are selected such that they minimize the discrepancy functional, w = arg min w∈IR ns J (w).
The minimization is performed with any standard optimization algorithm Díez (2003); Ciarlet (1989). In the example shown in the next section, the minimization method used is fmincon, interior-point algorithm implemented in Matlab, simply imposing the weights to be nonnegative, that is w j ≥ 0.
In the case of the Gaussian kernel given in (22), recalling that V is a unit matrix (premultiplying by V does not alter the norm) and that the logarithm is monotonic, the previous discrepancy functional is replaced by an alternative form that measures discrepancy of the values of the logarithms Note that functionals J in (33) and (35) are different but they do lead to solutions minimizing the discrepancy of model and data.
Using the expression of the Gaussian kernel, (32) becomes and therefore the discrepancy functional J (w) adopts an explicit form in terms of the unknown weights w, namely In order to ease the computational effort of the minimization algorithm it is convenient to reduce the number of unknown parameters by using some criterion based on the distance d j . For instance, take as actual unknowns only the weights that correspond to samples with a distance in the reduced space lower than some threshold. This may drastically reduce the dimension of the minization problem from n s to the number of samples to be considered close enough to z (and hence to x ).

Numerical test
The backward mapping proposed above (pre-image reconstruction) is tested in an example illustrated in Figure 2. The sample (training set) contains n s = 69 frames of a video consisting in an open hand that closes and opens once (nine of them shown in figure 2).The number of grey-scale pixel values in each frame is d = 1080 × 1920 = 2073600. An extra sample x 0 is kept apart to check the backward and forward mapping. Picture x 0 is a frame of the video outside the training set of n s samples and located in the center of the running time. Intuitively, this set of pictures can be described by a single parameter (the opening of the hand) and therefore the nonlinear manifold where the sample belongs is expected to be of dimension one.
The kPCA methodology described in section 3 associated with an exponential kernel, see (22), provides a reduced model of three variables k = 3 that collects approximately 50% of the variance (corresponds to ε = 0.5). This low level of accuracy is preferred in the example (ε = 0.5 is too large) because k = 3 is allowing a graphical illustration. Actually, figure 3 shows (in blue) the representation of the reduced samples z j , j = 1, . . . , 69. Note that the configuration of the samples in the 3D space suggest that the actual intrinsic dimension of the manifold is one. The image x 0 is mapped to the 3D space using the strategy devised in Section 3.4 (see red dot in Figure 3).
The new sample in the reduced space is mapped backward to the input space IR d following the ideas in Section 4. This produces the pre-image x 0 which is   an approximation to x 0 , see Figure 4. Note that the dimensionality reduction (from 2073600 degrees of freedom to 3) is preserving most of the features of the image.

Appendices A Three vectors product rule
All along the paper, the following product rule is extensively used. For three arbitrary vectors a and b and c in IR d , it holds that (ab T )c = a(b T c) or, using the tensorial and scalar product notation This is easily shown considering the index notation where the Einstein summation convention is used.
B The first d columns of V are equal to B Proposition 1. The n s × d matrix B introduced in (13) coincides with the first d columns of matrix V.
Proof. The n s × d matrix B is for = 1, . . . , n s and j = 1, . . . , d. As shown in (12) Thus, recalling that C = XX T = ns =1 x x T the eigenvalue problem (10) results in The left-hand-side of (40)  The previous expression is written in matrix form as This proves that the d columns of B are indeed eigenvectors of G associated with eigenvalues λ i , for i = 1, . . . , d and therefore they do coincide with the first d columns of V (possibly up to a normalization constant).