Unsupervised Learning Through Generalized Mixture Model


 A generalized way of building mixture models using different distributions is explored in this article. The EM algorithm is used with some modifications to accommodate different distributions within the same model. The model uses any point estimate available for the respective distributions to estimate the mixture components and model parameters. The study is focused on the application of mixture models in unsupervised learning problems, especially cluster analysis. The convenience of building mixture models using the generalized approach is further emphasised by appropriate examples, exploiting the well-known maximum likelihood and Bayesian estimates of the parameters of the parent distributions.


Introduction
In the 21st century, Machine Learning has become a very important tool to analyse data. For labeled data, many supervised learning and deep learning techniques have been developed and those techniques have provided promising results [11]. But getting labeled data is often troublesome as it requires a lot of resources for a reliable labeling. Unsupervised learning, on the other hand, can operate on unlabeled data and learn complex structures without the need of human participation in the process. Three examples are clustering, density estimation and outlier and anomaly detection [12]. Typically, features (e.g. age, gender, income, etc.) are denoted as a vector of random variables (X 1 , . . . , X p ) where p is the dimension of the feature vector. In regular tabular data, N is the number of independent realizations of this feature vector (e.g. observations on N individuals). But the feature vector can be much more complex as in word embeddings for natural language processing [1] or as pixels in red-blue-green channels with a certain resolution in image analysis [24].
Over the years, many unsupervised learning techniques, especially clustering methods, have been proposed (see [7], [18], [13], [19]), but in practice the K-Means algorithm and Gaussian Mixture Models are predominantly used. Even though a detailed study of all popular machine learning models have been examined by many authors (eg. [22], [3]), there is always a need for more generalized and convenient models as no unsupervised technique seems to work best in all types of data. A famous example is the multivariate Gaussian Mixture Model (GMM), where one assumes that the multivariate data is generated by a mixture of K multivariate p-dimensional normal distributions [20]. The number of mixture components is typically a hyper parameter and fixed through some ad hoc procedure or by using an information criterion (AIC, BIC). But the GMM is considered as not being robust against non-normality in the data and heavy tails. Hence, for different data structures such as asymmetric and skewed data, mixtures of other distributions or mixtures of different distributions become necessary.
For estimating the parameters of mixture models, the EM algorithm [6], [2], [20] is widely used. In many applications of mixture models, e.g. in image matching [16], audio and video scene analysis [8], the EM algorithm is being used regularly. But the EM algorithm is often not very convenient to apply for other than normal distributions (same or different), because it needs to be modified and adapted for each case. Sometimes, updating the parameters in the M step becomes impossible for some distributions [5].
In this article, we introduce a generalized EM algorithm, which can accommodate any mixture of distributions without changing its core structure. All that is needed are some point estimates of the parent distributions. It is very convenient as it works with both, likelihood based and Bayesian estimates. In the following sections, we will discuss the methodology and demonstrate it with some simple examples.

Methodology
For clustered data, it can be assumed that it comes from a mixture of k different distributions. Let us assume that X 1 , X 2 , ..., X k are k random vectors (assume further, that each random vector consists of p random variables), following k different distributions with probability density functions f (x j |α j ), j = 1, 2, ..., k. Here, α j , j = 1, 2, ..., k, are vectors of component specific parameters for each density. Then α = (α 1 , . . . , α k ) denotes the vector of all parameters of the model. The sample size is denoted as N . The mixture model density for one observation x is given by the mixture density where π = (π 1 , . . . , π k ) contains the corresponding mixture proportions. Note, that this notation does not re-ally reflect the fact that one specific observed p-dimensional data point is assumed to be a random sample of one of the k components. The log likelihood of the model for a sample of size N is then given by The parameters can be estimated using the EM algorithm with some modifications. For that purpose, latent variables z ij are introduced denoting whether observation i comes from the mixture component j (then z ij = 1) or not (then z ij = 0). Alternatively, we can write z i = c with c ∈ {1, 2, . . . , k}. Further, probabilities γ ij are introduced: For an EM algorithm we try to optimize the function where t is the current iteration number. It can also be shown that, At the M step, we optimize Q with respect to π and α. π j is estimated in the usual way by Nj N , where, N j = N i=1 γ ij and for estimating α, we look at the part in Q which depends on α, which is given by, Now, we choose α j such that α t j = argmax αj l(α j ), which is obtained by the process of assigning data points to respective clusters, where z i = argmax j γ ij , and estimate α j by some estimation method based on the assigned observations to that cluster. It can be seen as a Bayesian concept (although not strictly Bayesian) for learning and equation 3 can be seen as p(z = c|x, α), denoting the cluster membership probability. The idea of choosing the cluster based on maximum probability is as same as choosing the MAP estimate, the mode of the distribution of p(z = c|x, α).
To run the algorithm, at first, some trial values of the distribution parameters α and mixture proportions π are initialized. Then the initial value of the log likelihood is evaluated. For different distributions, different techniques can be used to choose suitable initial values. For example, in case of a GMM, the centroids of KMeans can be used as initial values of µ and the empirical covariance matrix of each cluster can be taken as initial value of Σ j . The initial values of π can be obtained generating a random number from a Dirichlet (1,1,1,...,1) distribution.
At the E step, the values of the probabilities γ ij are evaluated using the current parameter values. For an usual EM algorithm (e.g. in a GMM), at the M step, a weighted mean and a weighted covariance matrix are calculated using the γ ij values. But for other distributions, where the model parameters are not mean and (co)variance, this technique can not be used. So, for different distributions, different techniques needs to be used. To introduce a flexible, yet convenient solution, we have proposed a different technique in our generalized EM algorithm, where at the M step each data point is assigned to a cluster depending on the probability of that data point belonging to each cluster. That cluster is assigned for which the probability is maximum. After assigning all the data points to one of the clusters, point estimates of the parameters of each parent distributions are obtained using only the data points available in each cluster. For faster convergence and convenience, maximum likelihood estimates can usually be recommended. The mixture component probabilities π j are estimated as mentioned above by Nj N . The newly set of estimated values of the parameters are then be used as an update over the previous one. After this step, the log likelihood is evaluated again using the updated parameter values. The process is then continued until convergence. The convergence properties of this algorithm follows the properties of the usual EM algorithm, which has been explained in detail by [19].
For our experiments, we have used 0.0001 as the value of in the Algorithm 1.

Examples
The generalized approach explained in the above section has the advantage that it can be used for mixtures of different distributions, if the MLE or any other point estimate is known for each component density. The process is further simplified if the mixture densities belong to one family of distributions (e.g. all densities are normal or all probability mass distributions are Poisson). In any case, the approach can be used without any mod-

Algorithm 1: Generalized EM Algorithm for a Mixture of Different Distributions
Initialize the model parameters, α and π. Evaluate the initial value of the log likelihood from equation (2); while log likelihood difference ≥ do Evaluate γ ij from equation (3), using the parameter values and data; Assign Data Point i to cluster z i ; end for j in 1 to k do if cluster j is empty then Let us now look at some simple examples where we build mixture models using common distributions. We have used an accuracy measure to check the efficiency of our models. The accuracy measure is given by the ratio of correct assignments to the clusters and the total number of observations

Mixture of an Exponential and a Normal Distribution (EGMM)
To demonstrate the usefulness of the generalized mixture model, we have considered datasets which are a mixture of a highly asymmetrical and a bell shaped symmetrical distribution. We have modelled such data by a mixture of an exponential and a normal distribution. The pdf of an exponential distribution with parameter λ, λ > 0 is given by, and the MLE of λ is given by The pdf of a normal distribution is given by and the MLEs of µ and σ 2 are given bŷ The mixture model can be built using equation 1 and the parameters can be estimated by using the MLEs at the M step of algorithm 1.
We have done two simulation studies and compared accuracy with the usual GMM over 100 simulations. For the first experiment, at each step, we have generated 1000 random samples from am exponential distribution with parameter value 15 and 1300 random samples from a Normal distribution with mean 50 and standard deviation 4. And for the second experiment, at each step, we have chosen 1000 random samples from an exponential distribution with parameter value 2 and 1300 random samples from a Normal distribution with mean 8 and standard deviation 2.5. The empirical distributions of the data of one such simulation of each of the two cases are shown in Figure 1 and    Figure 6. From the results, it is seen that a mixture of an exponential and a Gaussian distribution works much better than a GMM for our chosen data    set. It confirms our idea that for different symmetric and asymmetric structures of the data, mixtures of ap-propriate distributions must be used instead of using GMM in every case.

Mixture of Dirichlet Distributions (DMM)
If X 1 , X 2 , ..., X p follow a Dirichlet distribution, the density is given by, If we make a finite mixture with k components, the model is given by equation 1 and subsequently, the log likelihood is given by equation 2.
The model parameters, can be easily estimated using the generalized approach. For that we need a good point estimate of the parameters of a Dirichlet distribution to be used in the M step of the generalized EM algorithm. [21] has discussed a way to find out the maximum likelihood estimates of a Dirichlet distribution, where he proposed to perform a fixed point iteration, given an initial value of the α parameters. The equation is given by This iteration in the algorithm requires inverting Ψ , which is a digamma function. A suitable inversion algorithm is also discussed by [21]. We have done a simulation study to check the efficiency of the proposed technique. For a DMM, algorithm 1 can be used without any alteration. At first, we have drawn 500 random samples from Dirichlet(30,20,10), 100 random samples from Dirichlet (10,20,30) and 300 random samples from Dirichlet (15,15,15). And then we have used the clustering algorithm 100 times on the same data to check the consistency. Figure 9 and Figure  10 show the corresponding results. After that we have simulated data from the same distributions 100 times and applied the Dirichlet mixture model and K Means each time to compare its performance. The results are shown in Figure 13 and Figure 14.
We have done another simulation study by choosing a more difficult data structure. We have drawn 500 random samples from Dirichlet(10,10,3), 100 random samples from Dirichlet (10,20,50), 300 random samples from Dirichlet (15,15,15) and 400 random samples from    Dirichlet(0.2,0.5,3), each a 100 times. The data set generated in one such simulation is shown in Figure 8 with the true clusters. At each step we have compared the     We see that the algorithm is consistent to cluster the data points to the right cluster with very good accuracy. And as expected, the DMM works better than K Means for data drawn from a DMM model.

Mixture of Gaussian Distributions (GENGMM)
For a random vector X 1 , X 2 , X 3 , ..., X p , the density of p variate multivariate normal distribution is given by, where µ is a p × 1 vector, Σ is a p × p symmetric matrix and X is a vector in IR p . The maximum likelihood estimates of µ and Σ are given byμ The mixture model can be built the usual way and the model parameters can be estimated using the MLEs of a Gaussian distribution at the M step.
We have chosen a synthetic data set for our simulation study from [10]. We have deliberately chosen this data for its unique and difficult cluster structure. The data with its true clusters are displayed in Figure 19.  The corresponding results are shown in Figure 22 and Figure 23. From the results we see that, even though the clustering done by the two algorithms sometimes differ, the average accuracy is almost the same.
We have also done a simulation study, where we have drawn 200 random samples from a multivariate normal distribution with µ = (3, 3, 3) and covariance matrix with each diagonal element 0.8, 250 random samples from a multivariate normal distribution with µ = (7,5,4) and covariance matrix with each diagonal element 0.9 and 800 random samples from a multivariate normal distribution with µ = (3,7,15) and covariance matrix with each diagonal element 3, each 100 times. The data of one such simulation with true clusters are shown in Figure 24. At each step we have run a GMM with our generalized approach as well as the usual approach and compared the accuracy.
The corresponding results are shown in Figure 27 and  Figure 28. From the results, it is clearly seen that the generalized approach is very consistent with high accuracy when compared to the usual approach.

Mixture Model with Bayesian Estimates (BGMM)
Prior knowledge about the parameters can also be incorporated in the mixture model by using the generalized approach. Instead of using the MLE, Bayesian point estimates can be used in the M step of the algorithm. In our study we have used Variational Bayesian Methods for the Bayesian analysis (see [9], [14], [4]) of a multivariate normal distribution. In general, the multivariate normal density is given by equation 12. The prior distribution of µ is chosen as a multivariate normal distribution with mean µ 0 and covariance matrix 1 λ Σ, where the prior of Σ follows an Inverse Wishart distribution with parameters Λ and ν. ∴ (µ, Σ) ∼ N IW (µ 0 , λ, Λ, ν) It is a conjugate prior, so that the posterior distribution also follows a Normal Inverse Wishart Distribution with For our study, we have taken µ 0 as a vector of zeros of length p, λ as 0.00001, Λ as p × p diagonal matrix with each diagonal element 0.0001 and ν as p.
We have used the generalized EM algorithm with Bayesian estimates for image segmentation. In computer vision, RGB images from the Corel image database are widely used (see [17], [23]). We have taken four images from the database of 40000 images for simple image segmentation, by colour. The generalized EM algorithm with Bayesian estimates seems to work quite well with image segmentation. We have used two clusters for all the images. The results are obtained as expected and shown in Figure 29.

Conclusion
We have shown a generalized way of building mixture models using a generalized EM algorithm. Mixtures of different symmetric and asymmetric distributions seem to work quite well. Sometimes, the generalized approach has shown little different and interesting patterns in the data. But it always works with comparable accuracy as the GMM or often even better.
Mixture models of different distributions can also be built by using the MLE or other estimates of the subsequent distributions in the M step of the generalized EM algorithm. We could try e.g. multivariate t, skew normal or Pearson type VII distributions [15]. Further models and discussions are left for future work. The algorithm has successfully used different Bayesian estimates to find patterns in the data. Even though the generalized EM algorithm converges with a minimum number of iterations (5-10), the computing time is sometimes higher, especially with Bayesian methods. The algorithm remains unchanged for different distributions and different types of estimates, which makes it very flexible. For practical purposes, our over-reliance on the normal distribution should be challenged and new generalized multivariate distributions with both symmetric and asymmetric nature should be developed. Using such distributions in mixture models can give better results.