Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data

Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of"Curse of dimensionality"on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the"FlexCWM"R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.


Background History
Efficient dimension reduction is required to uncover the hidden patterns of information in real data such as engineering data.Dimension reduction can be used to convert data sets containing millions of functions into a practicable spaces for effective processing and analysis.Unsupervised learning is the main technique for dimensionality reduction.Conventional dimensionality reduction approaches can be integrated with statistical analysis to improve the performance of the high-dimensional data [[Rehman et al., 2016]].
Many dimensionality reduction techniques such as Principal component analysis (PCA) and T-distributed Stochastic Neighbor Embedding (TSNE) have been developed by statistical and artificial intelligence communities.In the recent years, there has been a rising interest in PCA mixture models.Mixture models provide an important framework for modeling complex data with a weighted component distribution.Due to their high flexibility and efficiency, they are used widely in many fields such as machine learning, image processing, and data mining.However, implementation in high-dimensional spaces are restricted by practical considerations because the component distributions are formalized as probability density functions.PCA mixture models are mixture-of-experts technique that model a nonlinear distribution through a combination of local linear sub-models with a fair sample distribution [[Jin et al., 2004]].For more information on PCA mixture models, interested readers should look up the following [ [Kim et al., 2003], [Xu et al., 2014], [Kutluk et al., 2016]].However, the main limitation is the linearity of PCA.PCA creates linear combinations of the existing features, so it fails to capture the nonlinearity in the features, thereby it is not able to interpret complex polynomial relationship between features.Therefore, PCA generally performs poorly if the relationship between the variables is nonlinear which may be inevitably present in high-dimensional data.

CWMs-TSNE for High-dimensional data
Finite mixture models have been widely and successfully applied in many fields such as biological, genetics, medicine, psychiatry, economics, engineering, marketing, astronomy, among many other fields in the biological, physical, and social sciences.In these applications, finite mixture models underpin a variety of techniques in major areas of statistics, including latent class analysis, discriminant analysis, image analysis, and survival analysis, in addition to their more direct role in data analysis and inference of providing descriptive models for distributions.
However, most of the finite mixture models assume the assignment independence which implies that the probability for a point to be generated by one of the cluster must be the same for all the covariate values x.On the other hand the assignment of data point into the cluster must be independent of the covariates [[Hennig, 2000]].The cluster membership is determined by the covariate values.There are two reasonable models for linear regression clusters that do not assume assignment independence.One strategy used is to replace the fixed covariates by covariate distributions that are allowed to differ between the clusters.This is also similar to the evolution of cluster weighted models.It assumes the varying covariates with a parameterized family of distributions.This solves the problem of assignment independence i.e. the covariate distributions of the mixture components is unique across the cluster.In the framework of mixture models with varying covariates, the cluster weighted model [CWM; [Gershenfeld, 1997]], is given by the equation p(y, x) = G g=1 π g p(y, x|D g ) = G g=1 π g p(y| x, D g )p( x|D g ), (1) also called the saturated mixture regression model [ [Wedel, 2002]], constitutes a reference approach to model the joint density.In Equation (1), normality of both p(y| x, D g ) and p( x|D g ) is commonly assumed [[Antonio, 2012]; [Gershenfeld, 1997]].In this paper, we will focus mainly on the application of CWMs to high-dimensional data.The data considered in this paper ranges from the tens to hundreds of features.Like any machine learning classifiers, CWMs clustering performance can be hindered by the redundancies in the feature space of the data.Moreover, the computation speed reduces exponentially with increase in dimensionality.We hereby present CWMs in the presence of high-dimensional data.We begin the discussion with an interplay between t-distributed stochastic neighbor embedding and CWMs for clustering high-dimensional data.

T-distributed Stochastic Neighbor Embedding Technique
The stochastic neighbor embedding (SNE) was first introduced by [Hinton and Roweis, 2002].SNE aims to place the objects in a low-dimensional space in order to retain neighboring identity, and can be naturally extend to allow multiple different low-dimensional images of each object, [ [Hinton and Roweis, 2002]].As a dimensionality reduction technique, SNE can construct a reasonably good performance of visualizations, however, it is hindered by a complex cost function that is difficult to optimize.[Maaten and Hinton, 2008] introduced a variation of SNE called TSNE.The aim of TSNE is to transform the high-dimensional data set X = (x 1 , ..., x n ) into low-dimensional data set Y = (y 1 , ..., y n ).tSNE is much easier to easier to optimize, and provides significantly better visualization by reducing the tendency to crowd points together in the center of the map, [ [Maaten and Hinton, 2008]].The cost function employed in TSNE is different from that of SNE.TSNE employed a symmetric version of SNE as an alternative to mitigate the problem of the presence of outliers.The asymmetric SNE used in SNE is given as follows; where q ij is the pairwise similarities in the low dimensional map and the way to define the pairwise similarities in the high-dimensional space p ij is given by These equations are referred to as symmetric because it has p ij = p ji and q ij = q ji for ∀i, j.Another uniqueness with TSNE is that TSNE applies a Student t-distribution with degree of freedom v = 1, similar to Cauchy distribution as the heavy-tailed distribution in the low-dimensional space.The joint probabilities for the low-dimensional map q ij instead becomes The advantages of employing a Student t-distribution can be found in the [Maaten and Hinton, 2008].The ultimate goal of TSNE is to represent p ij by q ij as accurate as possible, so the cost function is given by Gradient method is introduced for minimizing the cost function and the gradient has the form given by Equation ( 6) can be interpreted as the summation of a resultant force pulling y i in the direction of y j or pushing it away depending on whether j is observed as a neighbor of i.The gradient descent is initialized by sampling the map point Y (0) = (y 1 , ...y n ) randomly from N (0, 10 −4 I).A momentum is added to the gradient descent to speed up the optimization and avoid being stuck in local optimal.Finally, the gradient update is given by where Y (t) is the solution at the iteration t, ζ is the learning rate, and the α(t) is the momentum at iteration t.

General Formulation of CWMs
Let (X, Y ) be a pair of random vector X and random variable Y defined on D with joint probability p(x, y), where X is a d-dimensional input vector with values in some space X ⊆ R d and Y is a response variable having values in Y ⊆ R. The set of all model parameters is denoted Θ = (ω, µ, Σ, π).To begin with, we state that ω ∈ R d×G denotes the weight of the local model, location parameter µ ∈ R d×G , G is the number of groups, Σ is the positive definite covariance matrix, and the π is the mixing distribution with some constraints such as g π g = 1 and π g > 0.
Generally, CWMs are written as a sum where g enumerates the clusters, and p g (y, x) is a density of a specific form discussed below.The total number of clusters G must be chosen beforehand and can be selected based on the information criteria.The density p g (x, y) is written as The density p g (x, y) is written as p g (x, y) = p(y|x, z g ) p(x|z g ) π g (10) Where p(z g ) = π g .The terms in equation ( 10) have the following interpretation: Cluster Weights: The cluster weight π g ∈ [0, 1] denotes the amount of data described by the cluster g.The π g are chosen subject to the constraint Probability of inputs: The density p m (x) = p(x|z g ) describes the domain of influence of cluster g, that is, the distribution of inputs x around the cluster.They are chosen with assumption as Gaussian densities, i.e. p(x|z g ) ∼ N ( µ g , Σ g ) (12) with mean µ g and covariance matrix Σ g , effectively describing the location and the range of cluster influence.
When working in the high dimensional spaces, it suits well to reduced these input by separable Gaussian, with diagonal matrix of single variances in each dimension, i.e.Σ g = diag (σ g,1 , ..., σ g,d ).
Output terms: The density p(y| x) is the conditional density of the outputs y given the inputs x around the cluster g.The presence of the conditional distribution allows the input vector x to relate with target variable y.In general, they are chosen to as Gaussian densities p(y|x, z g ) ∼ N (f (x, β g ), σ g ) ( 14) with mean f (x, β g ) and variances σ 2 g describe the local models and the error around the cluster g.The vector β g denote the coefficient of the local model or the weight of contribution associated with the input vector x.The p g (y| x) are normalized thus p g (y| x)dy = 1 ∀g, x. (16) The cluster functions are chosen based on the type of supervised learning (Regression or classification) we wish to do.It is mostly chosen as linear combination of basis functions f i (x).
The model output of the CWM is therefore weighted averagely by the local functions f (x, β g ).The Gaussian, which are the input densities p g (x), controls the behavior of the local functions.The real problem is to find a good parameter values for • the weights π g , • the means µ g and the variances Σ 2 g of the input density, • the variances of the output terms σ 2 g , • the parameters of the local functions β g .

EM algorithm applied to CWMs
In the case of CWMs, as described above in section (2.1) the likelihood becomes easier by introducing a pseudo variable called a latent variable which we can interpret as unobserved data.This unobserved random variable can be imagined as sampling each pair (x i , y i ) from a single cluster with some probability.Let Z i ∈ {1, ..., G} be the label of the cluster that gave rise to (x i , y i ).This random variable is unobserved.The cluster weights π g equation ( 10) are interpreted as the probability that Z i = g for all g = 1, ..., G, implying that Z i are distributed as multinomial distribution parameterized by the cluster weights π g .Handling cluster model parameters through maximum likelihood would be straight forward if the cluster which generates each sample was known a priori.For example, each cluster center would be the cluster mean of all points from each label.However, since this information is hidden, estimating through maximum likelihood become a nonlinear optimization problem which comes with difficulty.For this problem, EM algorithm is elegant and efficient algorithm when involving latent variables.The realization of the Z i are written as an indicator vectors z i = (z i1 , ..., z iG ) T , where The training set is written as Ω C = {( x 1 , y 1 , z 1 ), ..., ( x N , y N , z N )} and the complete log-likelihood L c as where Θ denotes the entire parameter space of the cluster weighted model, namely the weights, the means µ g and the variances σ g of the cluster centers as well as the parameters β g , Σ g of the local models.
The EM algorithm optimization is initialized by the estimates Θ (0) of these parameters.One possibility, which was also used in the following, is to initialize the cluster weights uniformly i.e. π g = 1/G, random cluster mean µ g by random numbers or simply picking randomly from the training data and all variances σ g start with identity matrix.
In the expectation step (E-step) of the algorithm as described in chapter (??), the conditional expectation of L c is computed with respect to the current parameter estimate, given rise to the following Q− function: The conditional expectation affects only z ig since the terms in the logarithm depend on x i and y i .E-step is effectively reduced to a calculation of the expectation of z ig , given the observed training data.We introduce q(x, y; According to the definitions from section (2.1), the posterior probability is in general given by q(x, y; Each cluster is able to relate with each data point through this distribution.Looking at Equation ( 21), one can see that posterior is the ratio of one cluster to all the the cluster.Given the expectation value, the Q−function is given by In the maximization step (M-step), the next parameter estimate ˆ Θ is obtained by the global maximization of the Q−function with respect to Θ over the parameter space.The derivatives with respect to the desired parameter is calculated by taking the gradient with respect to the parameter of interest and setting to zero, thus obtain a new set of parameters Θ as a function of the old parameters ˆ Θ.This procedure is repeated until convergence.Applying the logarithmic law, Q−function can be decomposed as follows: This decomposition is useful as taking the gradient with respect to the parameter of interest becomes convenient.For example, the cluster weights π g , can be computed independently of the other while others summands without the parameter of interest becomes zero automatically.Since the weights are with constraints π g = 1 and 0 ≤ π g ≤ 1, Lagrange multiplier is introduced as follows: Figure 1: Models Used in CWMs clustering: Example of contours of the bivariates normal component densities for the 14 parameterization of the covariance matrix.Source: [Bouveyron et al., 2019] which now leads to which can equally be interpreted as i z ig /N , where the unknown labels are substituted by their expectation value.
The update estimates for the means and variances of the clusters ( µ g , σ g ) are also derived by maximized Thus, the updated means are given by

Geometrically Constrained CWMs
The full multivariate Gaussian for CWMs discussed above has posed a lot of problem in the estimation process.Some of the problems which are due to high-dimensional space or large d can be associated to the problem of matrix inversion caused by singularity, degeneracies of the algorithm.For full covariance matrix the parameters to be estimated are (G − 1) + Gd + G[d(d + 1)/2].This parameters are quite a large number number.For example in the Epileptic Seizure data, with d = 178 and G = 5, this is 128, 879 parameters to be estimated, which is too large for any clustering model.Such a large numbers of parameters can lead to difficulties in estimation, including lack of precision or even cause the algorithm to degenerate.They also reduce the computational speed of the algorithms.In order to mitigate this problem, [Banfield and Raftery, 1993] and [Celeux and Govaert, 1995] introduced the eigenvalue decomposition of the cluster covariance matrix Σ g , in the form In Equation ( 26), D g is the matrix of the eigenvectors of Σ g , A g = diag{A 1,g , ..., A d,g } is a diagonal matrix whose elements are proportional to the eigenvalues of Σ g arranged in a descending order, and λ g is the constant associated with the proportionality.Each elements in this decomposition corresponds to a particular geometric property of the gth component.The matrix of the eigenvectors D g determines its orientation in R d .The diagonal matrix of scaled eigenvalues A g governs its shape.The region where the gth is densely concentrated can be determined by the maximum number of the shape in the plane.For example, if If all the values of A j,g are approximately equal, then the gth component is roughly equal.The constant of proportionality determines the volume.This is proportional to λ d g |A g | where ||A g is determinant of A g preferably constrained to be equal to 1. Parsimony occurs in different ways using the decomposition by either constraining any or all of the volume, shape or orientation to be to be equal or varied across the clusters.Also, the covariance matrix can be forced to be spherical i.e.Identity matrix I. Whenever the covariance matrix is spherical, there are two univariate models, and 14 possible models in multivariate case.Figure (1) shows the examples of contours of the component densities for the various models in the two-dimensional case with two mixture components.These constrained models can have extremely fewer parameters that need to be estimated independently than the full covariance model, while fitting the sample data almost as well.The constrained models can yield more precise estimates of model parameters, accurate out-of-sample predictions, and easy interpretability of parameter estimates.Moreover, the model have Gd parameters for the component means µ g , and (G − 1) parameters for the mixture proportions π g .Table (1) shows the multivariate models denoted by three-letter identifier where "E" stands for Equal and "V" stands for variable.If the first letter is "E" it means the volume is equal/constant across the clusters, and "V" if varied across.In the same vein, the second letter "E" represents equal shape and "V" if not, so that for all g = 1, ..., G, the shape matrices A g ≡ A. "I" stands for spherical when the A g = I for g = 1, ..., G. Finally, if "E" is located at the the third position, then the D g of eigenvectors specify the cluster orientations are equal D g ≡ D for g = 1, ..., G, "V" if they are not constrained, and "I" if the clusters are spherical such that D g = I for g = 1, .., G.
Table (2) shows the numbers of parameters needed to specify the covariance matrix for each model in the 178-dimensional five-component case, d = 178, G = 5, three-dimensional five-component case, d = 3, G = 5 gotten from the Epileptic seizure recognition data before dimensionality reduction, and after dimensionality reduction, respectively.Before performing the dimensionality reduction, we note that CWM is impracticable.These results are obtained by noting that for one mixture component, the volume is specified by 1 parameter, the shape by (d − 1) parameters, and the orientation by d(d − 1)/2.The potential gain in the combination of parsimony and dimensional reduction is far higher than the gain achieved from only parsimony compared to the full covariance matrix parameters.In the most extreme case in Table (2), in the 178-dimensional case with 5 mixture components, the VVV model requires 79, 655 parameters to represent the covariance matrices, whereas the same VVV requires 30 parameters with the combination of dimensionality reduction and eigenvalue decomposition.Although, there are some gains in parsimony, however it has been observed that the most parsimonious models do not always fit the data adequately.Moreover, the number of parameters to be estimated in parsimonious model is still outrageously high, and the preferable solution would be to apply some steps further parsimonious method to the results of the parsimonious model.However, this might not achievable if the computational time is a priority.
Alternatively, the best solution would be to perform dimensionality reduction before using parsimony.Unfortunately, Eigenvalue decomposition method does what we can call a "local parameter reduction" when the "global feature" remains huge.Fitting the huge original high-dimensional data irrespective of the parsimony encumbers CWMs model.Consequentially, reducing the classification power, slows the computation speed, and lead to misinterpretation of the result.This becomes a challenge in CWMs models.

Dimensionality reduction
Given the high-dimensional nature of the dataset considered here, a preprocessing step of feature extraction is of great importance to reduce the computational burden and time complexity before fitting the CWMs model.The considered preprocessing step proceeds as follows; first of, we fit the feature set to the tSNE, and afterwards we project the test unit nonlinearly to obtain the low-dimensional subspace.Without dimensionality reduction process, CWMs can be so limited by the high-dimensional data which slows down the computational speed and hamper the clustering performance of the model.The subspace of the original features are then filtered into the CWMs for clustering analysis.Moreover, the visualization of the highdimensional data is made possible by tSNE technique.Here, we present both low-dimensional data and high-dimensional data with features running to the order of hundred.

Results and Discussion
This section illustrates some real data applications of the linear CWMs defined above with a substantive high dimensionality.The analysis is performed using the R package for CWMs called FlexCWM, [ [Mazza et al., 2018]].

Abalone data
The first application concerns the prediction of age of abalone from physical measurements.The data was taken from UCI database with the original sources of Marine Resources Division and [Sam, 1995]; [Warwick et al., 1994].The age of abalone was determined by counting the number of rings through a microscope after cutting the shell through the cone, and staining it.The analysis presented below uses all the variables in the dataset.The following are the attributes of the data; Sex: Male (M), Female (F), and Infant (I), Length: Longest shell measurement, Diameter: Perpendicular to length, Height: With meat in shell, Whole.Weight: Whole abalone, Shucked.Weight: Weight of meat, Viscera.Weight: Gut weight after bleeding, Shell.Weight: Gut weight after being dried, Rings: Age in years of the Abalone.
There are G = 3 groups of abalone with respect to Sex variables: M = 1528, F = 1307, and I = 1342.First off, we use the whole variables and check the effect on the clustering power of the linear CWM.We compare the Bayesian Information criteria (BIC) produced by fourteen different parsimonious models.Figure (2) concerns the observed labeled data.This graphical representation is the visualization of the descriptive summary of the abalone data.The observations are color-coded according to the group of the Rings variable grouped into 3-class category; 1 − 8, 9 − 10, and ≥ 11.The goal is to classify the abalone according to their age group.The nonlinear projection from the original feature space to low dimensional feature space is performed.However, the goal is not to separate the observation to their respective classes but to reduce the dimension of the data which leads to the removal of any multi-collineariy among the  features.Afterwards, we filtered the projected feature into CWMs.This is always better in terms of speed and accuracy.We perform the analysis on the original data and the parsimonious models selected the same number of component as the projected data.This assures us that the low-dimensional data is a good representation of the original data.However, all the eight information criteria have an extremely high number produced by the original data.This might be due to redundancy in the feature of the original data.
The data can be seen as a nested cluster or as having both global and local components, i.e. cluster through the sex variable of the Abalone which are male (M), Infant (I), and Female (F), and the grouping through the age of the Abalone.This makes the data extreme difficult to separate.The previous work by [Sam, 1995] also confirmed the presence of overlap in the data while he suggested additional information to separate the class completely using the affine combinations.We note that it is easier to separate the data with respect to the sex variable while the age group remains cluttered together.This can hinder the performance of the clustering algorithm.Figure (3) shows the values of BIC for the models in the CWMs-tSNE with G ranging from 1, ..., 5. We show the plot resulting from BIC.In CWMs-tSNE model, the four models that provide the largest values for the BIC were VEI, VVI, VEE, VVV with values: 176940.8, 177254.5, 176932.7, and 177462.9.In Table (3), we presented only the BIC values for the 14 models considered because the eight information criteria agreed in selecting the same number of components.The best models are distinguished with boldface.Also, the ARI and its variants are presented in Table (4).The ARI for the models selected by the BIC as shown in Table (3) is 1.In contrast, according to [Sam, 1995] the Cascading-Correlation with no hidden nodes and with 5 hidden nodes had 24.8% and 26.2% , while C4.5 achieved 21.5%, Linear Discriminant Analysis (LDA) achieved 0.0%, and the k = 5 Nearest Neighbor got 3.57% accuracy.

Protein data
The goal of the second application is to cluster the localization site of proteins.The protein data created by [Paul and Kenta, 1996]   cleavable signal regions from the sequence.According to the framework of CWMs, we transformed the multiclass response called the localized site by adding the 0.5 and taking the logarithm of the result.This is done to transform from a categorical variable to continuous.We pretended as if the true clustering is not known apriori and check which model would perform the best among the fourteen parsimonious models.In order to visualize the BIC values, Figure ( 6) shows the BIC plot for the protein data, using the R commands provided by the FlexCWM package.Values are shown for up to G max = 8 components and for the 14 covariance models estimated in the same package, i.e. for 8 × 14 different competing models in all.BIC selects the model with six mixture components and the EEE with 10 other covariance specifications, in which all the covariance matrices are either equal or varied.However, BIC selects the five models such as VII, VVI, VEE, and VEV with eight mixture components.Table (5 6) is generated by comparing the clusters produced by the BIC values with the true class of the localized site using the varieties of ARI.VVE model shows higher values of the ARI among all the models.According to the selection of the component produced by the CWM-tSNE model, Figure (6) shows the classification for the VVI selected model with respect to the number of cluster produced by the CWMs-tSNE.We note that AWE gives wrong number of clusters throughout the analysis.The protein data has been analyzed by [Paul and Kenta, 1996].In their work of "A Probabilistic Classification System for predicting the Cellular Localization Sites of Proteins", their model achieved 81% classification accuracy.Also similar accuracy has been achieved for Binary Decision Tree and Bayesian Classification methods.
Table 5: The comparison of the BIC produced by the fourteen parsimonious models after performing the dimensionality reduction  Table 6: Adjustment Rand Index and its variants of the fourteen parsimonious models to select the hidden structure or cluster in the protein data

Epileptic Seizure Recognition
We now analyze the Epileptic Seizure recognition data gotten from UCI.The original dataset consists of 5 different folders, each with 100 files, with each file representing a single subject/person.Each file is a recording of brain activity for 2.36 seconds.The corresponding time-series is sampled into 4097 data points.Each data point is the value of the EEG recording at a different point in time.So there is a total of 500 individuals with each having 4097 data points for 23.5 seconds.Every 4097 data points is divided and shuffled into 23 chunks, and each chunk contains 178 data points for 1 second.Each data point is the value of the EEG recording at a different point in time.So there is a total of 11500 pieces of information (row), each with 178 data points for 1 second (column), then the last column represents the class y = {1, 2, 3, 4, 5}.
The Epileptic data contains 178−dimensional input vector.The dependent variable y is defined as follows; 5: eyes open when the EEG signal of the brain was recorded.4: means eyes closed when the EEG signal was recorded, 3: mean they identified where the region of the tumor was in the brain and the recorded the EEG activity from the healthy brain area, 2: means the EEG was recorded from the area where the tumor was located, and 1: means the recording of seizure activities.The goal is to detect the underlying component of the data.In the previous works, the data has been treated as a binary classification where class 1 represents the presence of seizure in a patient and 2, 3, 4, 5 represent the absence of seizure.The label class is distributed equally as 2300.
The CWMs employs the Ordinary Least squares (OLS) for its maximization step of the EM algorithm, therefore it becomes inappropriate to fit the dependent variable which is a categorical variable.An alternative approach is to take the logarithm of the label class and add some noise to make it a continuous variable.Afterwards, we performed the dimensionality reduction on the independent variable of order 178.We note here again that the goal of tSNE is not for clustering, however we prioritize dimensionality reduction over clustering with tSNE. Figure ( 8) visualizes the high-dimensional data on a 2D plane with the perplexity = 15, iteration = 1000 and the theta = 0.5.According to the plot shown in Figure ( 8), there is a linear pattern as revealed by the tSNE.We observed that when the perplexity is between 9 and 15, and the theta = 0.5, tSNE gives an unsatisfactory low-dimensional data, this is called a "crowd point".However, due to high volume of the data, tSNE tends to be a bit slower than when performed on a moderately high-dimensional data.According to the setup of tSNE, there is a trade-off between speed and accuracy.The hidden structure in the high-dimensional data is preserved in the low-dimensional space.However, the epileptic seizure data is highly overlapped, this makes clustering extremely difficult to perform. Figure ( 9) shows a five-component structure of the CWMs plot on the low-dimensional data filtered into the CWMs model.Almost all the information criteria selected model with 5 mixture components.Although, we are able to visualize the highdimensional data but the clusters are not well separated.One limitation associated with the tSNE output in Figure ( 8) is that the information criteria tend to favor the number of the label class.This is however contrary to previous works which have performed binary classification where class 1 represent presence of Epileptic seizure in the patients against the absence of Epileptic seizure.To reduce the crowd points in  The number of components selected by BIC does not agree with one selected by ICL when using the model EEE.BIC suggested that the number of components is 3, while ICL suggested that the hidden number of components is 2. In the other models, the number of components selected by BIC agreed with ICL as they all selected 3 mixture components.patients even after 10, 000 iterations [Figure (10)].Afterwards, the output with the perplexity = 250 was filtered into the CWMs model.At this junction, we applied the 14 parsimonious models, and we observed a varying computation time due to their varying model complexities.The model selection was performed through eight different information criteria.We observed that the number of mixture component selected BIC did not agree with ICL.While the BIC selected the models with wrong number of components, ICL selected the model EEE with the correct number of hidden components.The output is provided in Figure (11).However, the overlap reduced drastically when compared to Figure (10).The data we have used in this Chapter are categorical data with class label more than two classes.All the class labels are first transformed to be continuous variables.This is necessary because the linear Gaussian CWMs models uses OLS for the maximization step and it can only handle a continuous dependent variable efficiently.The possible future direction should be to create a self-sufficient CWMs by embedding a dimensionality reduction technique into the CWMs package in R.This will allow the package to handle high-dimensional data.In one of my papers, we have tackled the limitation of the family of CWMs and mitigate the effect of the 'curse of dimensionality' on CWMs by developing an appropriate model that is suitable for categorical data in high-dimensional space.

Figure 2 :Figure 3 :
Figure 2: The visualization of the descriptive summary of the original Abalone data colored according to the grouping of the Rings: Black (1 − 8), Red (9 − 10), and Green (≥ 11)

Figure 4 :Figure 5 :
Figure 4: The classification plot of CWM-tSNE for G = 3 with model VVV selected by BIC.

Figure 7 :
Figure 7: The plot produced by CWMs after dimension reduction via tsne.CWMs selected eight components which aligns to the true class of the localization site of protein.
) lists the values of the BIC for the fourteen models.The values of the BIC according to the Table (5) are −6183.0(VII), −6183.0(VVI), −6155.1 (VEE), and −6129.2(VEV).Among all the models considered, the value of the BIC −6309.6 produced by EII is the worst model.Table(

Figure 11 :
Figure 11: The CWM-tSNE plot for clustering the low-dimensional representation of Seizure recognition data produced by tSNE with EEE model.
Figure (12) and Figure (13)  show the comparison between BIC and ICL on the number of mixture components.The values are provided in the

Table 1 :
Parameterizations of the covariance matrix Σ g through Eigenvalue decomposition.

Table 2 :
Numbers of the parameters needed to specify the covariance matrix for models used CWMs and CWMs-tSNE

Table 3 :
The selection of the best model among 14 models according to the BIC is VVV

Table 4 :
Adjustment Rand Index and its variants of the three-component Model for Abalone data.According to the BIC, the models VEI, VVI, VEE, and VVV give ARI = 1 and is available in the UCI database.The data consist of seven input variables and class variable.There are N = 336 observations and attributes information is as follows; Sequence Name: Accession number for the SWISS-PORT database, mcg: McGeoh's method for signal sequence recognition, gvh: Von Heijne's method for signal sequence recognition, lip: von Heijne's signal Peptidase II consensus sequence score, chg: Presence of charge on N-terminus of predicted lipoproteins, aac: Score of discriminant analysis of the amino acid content of outer membrane, alm1: Score of the ALOM membrane spanning region prediction program, alm2: Score of ALOM program after excluding putative VVVFigure6: Model selection for the protein data using BIC values of the fourteen models.The BIC produced by five models select the correct number of components.
Table (7).The left values are produced by BIC and the right values are the ICL.The ARI and its variants are provided in Table (8).The model with the highest values of ARI is EVE model.However, the classification accuracy produced by EEE model is 73%.