Gradient-Based Training of Gaussian Mixture Models for High-Dimensional Streaming Data

We present an approach for efficiently training Gaussian Mixture Model (GMM) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and can thus be trained based on a random initial state. Furthermore, the approach allows mini-batch sizes as low as 1, which are typical for streaming-data settings. Major problems in such settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities. We introduce an adaptive annealing procedure to address the first problem, whereas numerical instabilities are eliminated by an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and non-visual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization. It also compares favorably to an online variant of Expectation-Maximization (EM)—stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.


Motivation
Intrinsically, EM is a batch-type algorithm. Therefore, memory requirements can become excessive for large datasets. In addition, streaming-data scenarios require data samples to be processed one by one, which is impossible for a batch-type algorithm. Moreover, data statistics may be subject to changes over time (concept drift/shift), to which the GMM should adapt. In such scenarios, an online, minibatch type of optimization such as SGD is attractive since it can process samples one by one, has modest, fixed memory requirements, and can adapt to changing data statistics.

Related Work
Online EM is a technique for performing EM mini-batch wise, allowing to process large datasets. One branch of previous research [20,16,4] has been devoted to the development of stochastic Expectation-Maximization (sEM) algorithms that reduce to the original EM method in the limit of large batch sizes. The variant presented in [2] is widely used due to its simplicity and efficiency for large datasets. Such approaches come at the price of additional hyper-parameters (e.g., step size, mini-batch size, step size reduction), thus, removing a key advantage of EM over SGD. Another approach is to modify the EM algorithm itself by, e.g., including heuristics for adding, splitting and merging centroids [30,10,24,3,26,15,29]. This allows GMM-like models to be trained by presenting one sample after another. The models work well in several application scenarios, but their learning dynamics are impossible to analyze mathematically. They also introduce a high number of parameters. Apart from these works, some authors avoid the issue of extensive datasets by determining smaller "core sets" of representative samples and performing vanilla EM [11]. SGD for training GMM has, as far as we know, been recently treated only in [13,14]. In this body of work, GMM constraint enforcement is ensured by using manifold optimization techniques and re-parameterization/regularization, thereby introducing additional hyper-parameters. The issue of local optima is side-stepped by a k-means type centroid initialization, and the used datasets are low-dimensional (36 dimensions). Annealing and Approximation approaches for GMMs were proposed in [28,23,22,8]. However, the regularizers proposed in [28,22] significantly differ from our scheme. GMM log-likelihood approximations, similar to the one used here, are discussed in, e.g., [23] and [8], but only in combination with EM training. A similar "hard assignment" approximation is performed in [27]. GMM Training in High-Dimensional Spaces is discussed in several publications: A conceptually very interesting procedure is proposed in [12]. It exploits the properties of high-dimensional spaces in order to achieve learning with a number of samples that is polynomial in the number of Gaussian components. This is difficult to apply in streaming settings, since higher-order moments need to be estimated beforehand, and also because the number of samples is usually unknown. Training GMM-like lower-dimensional factor analysis models by SGD on high-dimensional image data is successfully demonstrated in [25]. They avoid numerical issues, but, again, sidestep the local optima issue by using k-means initialization. The numer-ical issues associated with log-likelihood computation in high-dimensional spaces are generally mitigated by using the "logsumexp" trick [21], which is, however, insufficient for ensuring numerical stability for particularly high-dimensional data, such as images.

Goals and Contributions
The goals of this article are to establish GMM training by SGD as a simple and scalable alternative to sEM in streaming scenarios with potentially highdimensional data. The main novel contributions are: a proposal for numerically stable GMM training by SGD that outperforms sEM for high data dimensionalities, an automatic annealing procedure that ensures SGD convergence without prior knowledge of the data (no k-means initialization) which is beneficial for streaming data, a computationally efficient method for enforcing all GMM constraints in SGD, a convergence proof for the annealing procedure.
Additionally, we provide a TensorFlow implementation. 1

Gaussian Mixture Models
GMMs are probabilistic models that intend to explain the observed data X = {x n } by expressing their density as a weighted mixture of K Gaussian component densi- . We parameterize Gaussian densities by precision matrices P k = Σ −1 k instead of covariances Σ k . GMM training optimizes the (incomplete) log-likelihood

GMM Constraint Enforcement for SGD
GMMs require the mixture weights to be normalized: k π k = 1 and the precision matrices to be positive definite: x P k x ≥ 0 ∀x. These constraints must be explicitly enforced after each SGD step: Weights π k are adapted according to [13], which replaces them by other free parameters ξ k from which the π k are computed so that normalization is ensured: Diagonal precision matrices are re-parameterized as P k = D 2 k , with diagonal matrices D k (Cholesky decomposition). They are, therefore, guaranteed to be positive definite. Hence, det computed efficiently. Since we are dealing with high-dimensional data, precision matrices are always taken to be diagonal, since full matrices would be prohibitive w.r.t. memory consumption and the number of free parameters. Full precision matrices are treated here for completeness' sake, since they are infeasible for high-dimensional data. We represent them as a spectral decomposition into eigenvectors v i and eigenvalues λ 2 i : P k = i λ 2 i v i v i , which ensures positive-definiteness. This can be seen from det Σ k = det P −1 k = i λ −2 i . In order to maintain a correct representation of eigenvectors, these have to be orthonormalized after each SGD step.

Max-Component Approximation for GMM
The log-likelihood eq. (1) is difficult to optimize by SGD due to numerical problems (mainly underflows and resulting divisions by zero) for high data dimensionalities. This is why we intend to find a lower bound that we can optimize instead. A simple scheme is given by where k * = arg max k π k N k (x n ). This is what we call the max-component approximation of eq. (3). In contrast to the lower bound that is constructed for EM-type algorithms, our bound is usually not tight. Nevertheless, we will demonstrate later that it is a very good approximation when data are high-dimensional. The advantage ofL is the elimination of exponentials causing numerical instabilities. The "logsumexp" trick is normally employed with GMMs to rectify this by factoring out the largest component probability N k * . This mitigates but does not avoid numerical problems when distances are high, a common occurrence for high data dimensions. To give an example: we normalize the component probability N k = e −101 (using 32-bit floats) by the highest probability N k * = e 3 , and we obtain N k N k * = e −104 , which produces an underflow.

Undesirable Local Optima in SGD Training
A crucial issue when optimizingL (and indeed L as well) by SGD without kmeans initialization concerns undesirable local optima. Most notable are the single/sparse-component solutions, see fig. 1. They are characterized by one or several components {k i } having large weights, with centroid and precision matrices given by the mean and covariance of a significant subset X k i ⊂ X of the data X: whereas the remaining components k are characterized by π k ≈ 0, µ k = µ(t = 0), P k = P (t = 0). Thus, these unconverged components are almost never Best Matching Unit (BMU) k * . The max-operation inL causes gradients like ∂L ∂µ k to contain δ kk * : This implies that the gradients are non-zero only for the BMU k * . Thus, the gradients of unconverged components vanish, implying that they remain in their unconverged state.

Annealing Procedure for Avoiding Local Optima
Our approach for avoiding sparse-component solutions is to punish their characteristic response patterns by replacingL by the smoothed max-component log- Regarding its interpretation, we are assuming that the K GMM components are arranged in a quadratic 2D grid of size fig. 2), with values given by a periodically continued 2D Gaussian centered on component k. With this interpretation, Equation (5) represents a 2D convolution with periodic boundary conditions (in the Fig. 2 Visualization of Gaussian smoothing filters g k , of width σ, used in annealing for three different values of σ. The g k are placed on a 2D grid, darker pixels indicate larger values. Over time, σ(t) is reduced (middle and right pictures) and the Gaussians approach a delta peak, thus, recovering the original, non-annealed loss function. Note that the grid is considered periodic in order to avoid boundary effects, so the g k are themselves periodic.
sense used in image processing) of the log (π k N k (x)) by a smoothing filter whose width is controlled by σ. Thus, eq. (5) is maximized if the log-probabilities follow a uni-modal Gaussian profile of spatial variance ∼ σ 2 , which heavily punishes single/sparse-component solutions that have a locally delta-like response. A 1D grid for annealing, together with 1D smoothing filters, was verified to fulfill this purpose as well. We chose 2D because it allows for an easier visualization while incurring an identical computational cost.
Annealing Control regulates the decrease of σ. This quantity defines an effective upper bound onL σ (see section 2.6 for a proof). An implication is that the loss will be stationary once this bound is reached, which we consider a suitable indicator for reducing σ. We implement an annealing control that sets σ ← 0.9σ whenever the loss is considered sufficiently stationary. Stationarity is detected by maintaining an exponentially smoothed average (t) = (1 − α) (t − 1) + αL σ (t) on time scale α. Every 1 α iterations, we compute the fractional increase ofL σ as and consider the loss stationary iff ∆ < δ (the latter being a free parameter). The choice of the time constant for smoothingL σ is outlined in the following section.

Training Procedure for SGD
Training GMMs by SGD is performed by maximizing the smoothed max-component log-likelihoodL σ from eq. (5). At the same time, we enforce the constraints on the component weights and covariances as described in section 2.1 and transition fromL σ intoL by annealing (see section 2.4). SGD requires a learning rate to be set, which in turn determines the parameter α (see section 2.4) as α = since stationarity detection should operate on a time scale similar to that of SGD. The diagonal matrices D k are initialized to D max I and are clipped after each iteration so that diagonal entries remain in the range [0, D 2 max ]. This is necessary to avoid excessive growth of precisions for data entries with vanishing variance, e.g., pixels that are always black. Weights are uniformly initialized to π i = 1 K , centroids in the range [−µ i , +µ i ] (see algorithm 1 for a summary). Please note that our SGD approach requires no centroid initialization by k-means, as it is recommended when training GMMs with (s)EM. We discuss and summarize good practices for choosing hyper-parameters in section 5.

Proof that Annealing is Convergent
We assume that, for a fixed value of σ, SGD optimization has reached a stationary point where the derivative w.r.t. all GMM parameters is 0 on average. In this situation, we claim that decreasing σ will always increase the loss. If true, this would show that σ defines an effective upper bound for the loss. For this to be consistent, we have to show that the loss gradient w.r.t. σ vanishes as σ → 0: as the annealed loss approaches the original one, decreases of σ have less and less effects.
Proposition The gradient ∂L σ ∂σ is strictly positive for σ > 0 Proof For each sample, the 2D profile of log(π k N k ) ≡ f k is assumed to be centered on the best-matching component k * and depends on the distance from it as a function of ||k − k * ||. We thus have f k = f k (r) with r ≡ ||k − k * ||. Passing to the continuous domain, the indices in the Gaussian "smoothing filter" g k * k become continuous variables, and we have g k * k → g(||k − k * ||, σ) ≡ g(r, σ). Similarly, f k (r) → f (r). Using 2D polar coordinates, the smoothed max-component likelihood L σ becomes a polar integral around the position of the best-matching component: It is trivial to show that for the special case of a constant log-probability profile, i.e., f (r) = L, L σ , does not depend on σ because Gaussians are normalized, and that the derivative w.r.t. σ vanishes: where we have split the integral into parts where the derivative w.r.t. σ is negative (N ) and positive (P). We know that N = P since the derivative must be zero for a constant function f (r) = L due to the fact that Gaussians are normalized to the same value regardless of σ.
For a function f (r) that satisfies f (r) > L∀r ∈ [0, σ[ and f (r) < L∀r ∈]σ, ∞[, the inner and outer parts of the integral behave as follows: since f (r) is minorized/majorized by L by assumption, and the contributions in both integrals have the same sign for the whole domain of integration. Thus, it is shown that for σ > 0 and, furthermore, that this derivative is zero for σ = 0 becauseL σ no longer depends on σ for this case.
Taking everything into consideration, in a situation where the log-likelihoodL σ has reached a stationary point for a given value of σ, we have shown that: the value ofL σ depends on σ, without changing the log-probabilities, we can increaseL σ by reducing σ, assuming that the log-probabilities are mildly decreasing around the BMU, -increasingL σ works as long as σ > 0. At σ = 0 the derivative becomes 0.
Thus, σ indeed defines an upper bound toL σ which can be increased by decreasing σ. The assumption of log-probabilities that decrease, on average, around the BMU is reasonable, since such a profile maximizesL σ . All functions f (r) that, e.g., decrease monotonically around the BMU, fulfill this criterion, whereas the form of the decrease is irrelevant.

Comparing SGD and sEM
Since sEM optimizes the log-likelihood L, whereas SGD optimizes the annealed approximationL σ , the comparison of these measures should be considered carefully. We claim that the comparison is fair assuming that i) SGD annealing has converged and ii) GMM responsibilities are sharply peaked so that a single component has a responsibility of ≈ 1. It follows from i) thatL σ ≈L and ii) implies thatL ≈ L. Condition ii) is usually satisfied to high precision especially for highdimensional inputs: if it is not, the comparison is biased in favor of sEM, since L >L by definition.

Experiments
Unless stated otherwise, the experiments in this section will be conducted with the following parameter values for sEM and SGD (where applicable): mini-batch size B = 1, K = 8 × 8, µ i = 0.1, σ 0 = 2, σ ∞ = 0.01, = 0.001, D max = 20. Each experiment is repeated 10 times with identical parameters but different random seeds for parameter initialization. See section 5 for a justification of these choices. Due to input dimensionality, all precision matrices are assumed to be diagonal. The training/test data comes from the datasets shown below (see section 3.1).

Datasets
We use a variety of different image-based datasets, as well as a non-image dataset for evaluation purposes. All datasets are normalized to the [0, 1] range. MNIST [17] contains gray-scale images, which depict handwritten digits from 0 to 9 in a resolution of 28 ×28 pixels -the common benchmark for computer vision systems and classification problems. SVHN [31] contains color images of house numbers (0-9, resolution 32 × 32).
FashionMNIST [32] contains grayscale images of 10 clothing categories and is considered as a more challenging classification task compared to MNIST. Fruits 360 [19] consists of color pictures (100 × 100 × 3 pixels) showing different types of fruits. The ten best-represented classes are selected. Devanagari [1] includes grayscale images of handwritten Devanagari letters with a resolution of 32 ×32 pixels -the first 10 classes are selected. NotMNIST [33] is a grayscale image dataset (resolution 28 × 28 pixels) of letters from A to J extracted from different publicly available fonts. ISOLET [5] is a non-image dataset containing 7 797 samples of spoken letters recorded from 150 subjects. Each sample is encoded and is represented by 617 float values.

Robustness of SGD to Initial Conditions
Here, we train GMM for three epochs on classes 1 to 9 for each dataset. We use different random and non-random initializations of the centroids and compare the final log-likelihood values. Random centroid initializations are parameterized by µ i ∈ {0.1, 0.3, 0.5}, whereas non-random initializations are defined by centroids from a previous training run on class 0 (one epoch). The latter is done to have a non-random centroid initialization that is as dissimilar as possible from the training data. The initialization of the precisions cannot be varied, because empirical data shows that training converges to undesirable solutions if the precisions are not initialized to large values. While this will have to be investigated further, we find that convergence to near-identical levels, regardless of centroid initialization for all datasets (see section 3.3 for more details).

Added Value of Annealing
To demonstrate the beneficial effects of annealing, we perform experiments on all datasets with annealing turned off. This is achieved by setting σ 0 = σ ∞ . This invariably produces sparse-component solutions with strongly inferior log-likelihoods after training, please refer to section 3.3. Table 1 Effect of different random and non-random centroid initializations of SGD training. Given are the means and standard deviations of final log-likelihoods (10 repetitions per experiment). To show the added value of annealing, the right-most column indicates the final log-likelihoods when annealing is turned off. This value should be compared to the leftmost entry in each row where annealing is turned on. Standard deviations in this case where very small so they are omitted.

Dataset
Initialization random non-random no annealing

Clustering Performance Evaluation
To compare the clustering performance of sEM and GMM the Davies-Bouldin score [6] and the Dunn index [9] are determined. We evaluate the grid-search results to find the best parameter setup for each metric for comparison. sEM is initialized by k-means to show that our approach does not depend on parameter initialization. Section 3.4 indicaties that SGD can egalize sEM performance.

Streaming Experiments with Constant Statistics
We train GMMs for three epochs (enough for convergence in all cases) using SGD and sEM on all datasets as described in sections 2.5 and 2.7. The resulting centroids of our SGD-based approach are shown in fig. 3, whereas the final loss values for SGD and sEM are compared in section 3.5. The centroids for both approaches Table 2 Clustering performance comparison of SGD and sEM training using Davies-Bouldin score (less is better) and Dunn index (more is better). Each time mean metric value (of 10 experiment repetitions) at the end of training, and their standard deviations are presented. Results are in bold face whenever they are better by more than half a standard deviation. are visually similar, except for the topological organization due to annealing for SGD, and the fact that in most experiments, some components do not converge for sEM while the others do. Section 3.5 indicates that SGD achieves performances superior to sEM in the majority of cases, in particular for the highest-dimensional datasets (SVHN: 3072 and Fruits 360: 30 000 dimensions). Table 3 Comparison of SGD and sEM training on all datasets in a streaming-data scenario. Shown are log-likelihoods at the end of training, averaged over 10 repetitions, along with their standard deviations. Results are in bold face whenever they are higher by more than half a standard deviation. Additionally, the averaged maximum responsibilities (p k * ) for test data are given for justifying the max-component approximation. Visualization of High-dimensional sEM Outcomes Section 3.5 was obtained after training GMMs by sEM on both the Fruits 360 and the SVHN dataset. It should be compared to fig. 3, where an identical procedure was used to visualize centroids of SGD-trained GMMs. It is notable that the effect of unconverged components does not occur at all for our SGD approach, which is due to the annealing mechanism that "drags" unconverged components along.

Assumptions made by EM and SGD
The EM algorithm assumes that the observed data samples {x n } depend on unobserved latent variables z n in a non-trivial fashion. This assumption is formalized for a GMM with K components by formulating the complete-data likelihood in which z n ≡ z n ∈ {0, . . . , K − 1} is a scalar: p(x n , z n ) = π z n N z n (x n ) (10)  where we have defined N k (x n ) = N (x n ; θ k , µ k ) for brevity. It is assumed that the z n are unobservable random variables whose distribution is unknown. Marginalizing them out gives us the incomplete-data likelihood p(x n ) = k p(x n , z n ). The derivation of the EM algorithm starts out with the total incomplete-data log- Due to the assumption that L is obtained by marginalizing out the latent variables, an explicit dependency on z n can be re-introduced. For the last expression, Jensen' inequality can be used to construct a lower bound: Since the realizations of the latent variables are unknown, we can assume any form for their distribution. In particular, for the choice p(z n ) ∼ p(x n , z n ), the lower bound becomes tight. Simple algebra and the fact that the distribution p(z n ) must be normalized gives us: where we have used eq. (10) in the last step. p(z n = k|x n ) is a quantity that can be computed from data with no reference to the latent variables. For GMM it is usually termed responsibility and we write it as p(z n = k|x n ) ≡ γ nk . However, the construction of a tight lower bound, which is actually different from L, only works when p(x n , z n ) depends non-trivially on the latent variable z n . If this is not the case, we have p(x n , z n ) = K −1 p(x n ) and the derivation of eq. (12) goes down very differently: where H represents the Shannon entropy of p(z). The highest value this can have is log K for an uniform distribution of the z n , finally leading to a lower bound for L of L ≥ n log p(x n ) (15) which is trivial by Jensen's inequality, but not tight. In particular, no closed-form solutions to the associated extrema value problem can be computed. This shows that optimizing GMM by EM assumes that each sample has been drawn from a single element in a set of K uni-modal Gaussian distributions. Which distribution is selected for sampling depends on a latent random variable. On the other hand, optimization by SGD uses the incomplete-data log-likelihood L as basis for optimization, without assuming the existence of hidden variables at all. This may be advantageous for problems where the assumption of Gaussianity is badly violated, although empirical studies indicate that optimization by EM works very well in a very wide range of scenarios.

Discussion and Conclusion
The relevance of this article is outlined by the fact that training GMMs by SGD was recently investigated in the community by [13,14]. We go beyond, since our approach does not rely on off-line data-driven model initialization, and works for high-dimensional streaming data. The presented SGD scheme is simple and very robust to initial conditions due to the proposed annealing procedure, see section 3.2 and section 3.3. In addition, our SGD approach compares favorably to the reference model for online EM [2] in terms of achieved log-likelihoods, which was verified on multiple real-world datasets. Superior SGD performance is observed for the high-dimensional datasets.
Analysis of results suggests that SGD performs better than sEM on average, see section 3.5, although the differences are very modest. It should be stated clearly that it cannot be expected, and is not the goal of this article, to outperform sEM by SGD in the general case, only to achieve a similar performance. However, if sEM is used without, e.g., k-means initialization, components may not converge (see section 3.5) for very high-dimensional data like Fruits 360 and SVHN datasets, which is why SGD outperforms sEM in this case. Another important advantage of SGD over sEM is the fact that the only parameter that needs to be tuned is the learning rate , whereas sEM has a complex and not intuitive dependency on ρ 0 , ρ ∞ and α 0 .
Small batch sizes and streaming data are possible with the SGD-based approach. Throughout the experiments, we used a batch size of 1, which allows streaming-data processing without the need to store any samples at all. Larger batch sizes are possible and strongly increase execution speed. In the conducted experiments, SGD (and sEM) usually converged within the first two epochs, which is a substantial advantage whenever huge sets of data have to be processed.
No assumptions about data generation are made by SGD in contrast to the EM and sEM algorithms. The latter guarantees that the loss will not decrease due to an M-step. This, however, assumes a non-trivial dependency of the data on an unobservable latent variable (shown in section 4). In contrast, SGD makes no hard-to-verify assumptions, which is a rather philosophical point, but may be an advantage in certain situations where data are strongly non-Gaussian. Numerical stability is assured by our SGD training approach. It does not optimize the log-likelihood but its max-component approximation. This approximation contains no exponentials at all, and is well justified by the results of section 3.5 which shows that component probabilities are strongly peaked. In fact, it is the gradient computations where numerical problems occurred, e.g., NaN values. The "logsumexp" trick mitigates the problem, but does not eliminate it (see section 2.2). It cannot be used when gradients are computed automatically, which is what most machine learning frameworks do. Hyper-Parameter selection guidelines are as follows: the learning rate must be set by cross-validation (a good value is 0.001). We empirically found that initializing precisions to the cut-off value D max and an uniform initialization of the π i are beneficial, and that centroids are best initialized to small random values. A value of D max = 20 always worked in our experiments. Generally, the cut-off must be much larger than the inverse of the data variance. In many cases, it should be possible to estimate this roughly, even in streaming settings, especially when samples are normalized. For density estimation, choosing higher values for K leads to higher final log-likelihoods. For clustering, K should be selected using standard techniques for GMMs. The parameter δ controls loss stationarity detection for the annealing procedure and was shown to perform well for δ = 0.05. Larger values will lead to a faster decrease of σ(t), which may impair convergence. Smaller values are always admissible but lead to longer convergence times. The annealing time constant α should be set to the GMM learning rate or lower. Smaller values of α lead to longer convergence times since σ(t) will be updated less often. The initial value σ 0 needs to be large in order to enforce convergence for all components. A typical value is √ K. The lower bound on σ ∞ should be as small as possible in order to achieve high log-likelihoods (e.g., 0.01, see section 2.6 for a proof).

Future Work
The presented work can be extended in several ways: First of all, annealing control could be simplified further by inferring good δ values from α. Likewise, increases of σ might be performed automatically when the loss rises sharply, indicating a task boundary. As we found that GMM convergence times grow linear with the number of components, we will investigate hierarchical GMM models that operate like a Convolutional Neural Network (CNN), in which individual GMM only see a local patch of the input and can therefore have low K.