Expectation Maximization Algorithm with Combinatorial Assumption


 Expectation maximization (EM) algorithm is a popular and powerful mathematical method for parameter estimation in case that there exist both observed data and hidden data. The EM process depends on an implicit relationship between observed data and hidden data which is specified by a mapping function in traditional EM and a joint probability density function (PDF) in practical EM. However, the mapping function is vague and impractical whereas the joint PDF is not easy to be defined because of heterogeneity between observed data and hidden data. The research aims to improve competency of EM by making it more feasible and easier to be specified, which removes the vagueness. Therefore, the research proposes an assumption that observed data is the combination of hidden data which is realized as an analytic function where data points are numerical. In other words, observed points are supposedly calculated from hidden points via regression model. Mathematical computations and proofs indicate feasibility and clearness of the proposed method which can be considered as an extension of EM.


Introduction
Expectation maximization (EM) algorithm developed by Dempster, Laird, and Rubin called DLR (Dempster, Laird, & Rubin, 1977) is an extension of maximum likelihood estimation (MLE) method when there exist both observed data X and hidden data Y and there is an implicit mapping φ: X → Y such that φ -1 (Y) = { : φ(X) = Y}. Let f(X | Θ) be the density probability function (PDF) of random variable X and let g(Y | Θ) be the PDF of random variable Y. Note, Θ is (vector) parameter.
The conditional PDF of X given Y, denoted k(X | Y, Θ), is speci ed as follows: Given sample Y = {Y 1 , Y 2 ,…, Y N } whose all Y i (s) are mutually independent and identically distributed (iid), EM has many iterations and each iteration has two steps in which expectation step called E-step determines the conditional expectation Q(Θ | Θ (t) ) and maximization step (M-step) re-estimates parameter as follows: E-step: The expectation Q(Θ | Θ (t) ) is determined based on current parameter Θ (t) (Nguyen, 2020).
EM converges at some t th iteration; at that time, Θ * = Θ (t+1) = Θ (t) is the optimal estimate of EM process. The expectation Q(Θ | Θ (t) ) based on the mapping φ(X) represents the traditional EM proposed by DLR. Because it is too vague to specify the mapping φ(X), practical EM issues the joint PDF of X and Y denoted f(X, Y | Θ) as prerequisite condition to run EM such that: The expectation Q(Θ | Θ (t) ) becomes (Nguyen, 2020): However, if X and Y are too different in context due to data heterogeneity and they are not independent, it will be di cult to de ne the PDF f(X, Y | Θ). In general, it is not easy to specify both the mapping φ(X) in traditional EM and the joint PDF f(X, Y | Θ) in practical EM. Therefore, the purpose of this research is to extend competency of EM, in which the observed data Y is assumed to be the combination of X, called combinatorial assumption (CA). In other words, Y can be calculated by an analytic function of X, which is more feasible than specifying the mapping φ(X) and easier than specifying the joint PDF f(X, Y). The analytic function is arbitrary but it is linear called regression function in this research for convenience and feasibility. In some cases, it is possible to convert some analytic functions into linear functions. The next section is the main one which focuses on EM with CA. In general, this research is an extension of EM algorithm.
Almost there is no work to modify EM algorithm itself by support of regression model as methodological extension of EM but many researches utilize EM to learn regression model. Although the context of my methodological research is different, it is necessary to survey some other works related to both EM algorithm and regression model. Zang et al. (Zhang, Deng, & Su, 2016) used EM to learn linear regression model from partial missing data. Their experimental results showed EM algorithm is better than other estimation methods in their case. The essence of regression model estimation is to estimate error random variable with suppose that the error variable distributes normally. If there is no assumption of normal error distribution, EM algorithm may not be effective, which is a drawback of EM. Therefore, Barazandeh et al. (Barazandeh, Ghafelebashi, Razaviyayn, & Sriharsha, 2021) proposed a new algorithm based on combining the alternating direction method of multipliers (ADMM) with EM ideology in order to overcome this drawback. Kokic (Kokic, 2002) also used EM algorithm to learn multivariate regression model in case of missing data and applied the method into time series analysis. Zhang and Rockette (Zhang & Rockette, 2006) proposed an EM algorithm for nding the SML estimate and for variance estimation. Essentially, they utilized EM in case of missing data. Because Gaussian mixture is often learned by EM algorithm, the mixture model of regression functions is also learned by EM algorithm. Faria and Soromenho (Faria & Soromenho, 2009)

Em With Combinatorial Assumption
Given sample Y = {y 1 , y 2 ,…, y N } of size N in which all Y i (s) are mutually independent and identically distributed (iid).
Suppose X = (x 1 , x 2 ,…, x n ) which is vector of size n distributes normally with mean vector µand covariance matrix Σ as follows: Where the superscript "T" denotes vector (matrix) transposition operator. The n x 1 mean µ and the n x n covariance matrix Σ: Note, σ ij 2 is covariance of x i and x j whereas σ i 2 is variance of x i . Let Y = (y 1 , y 2 ,…, y m ) T be random variable representing all sample random variable Y i = (y i1 , y i2 ,…, y im ) T . Note, X is n x 1 vector and Y is m x 1 vector. Suppose there is an assumption that Y is a combination of partial random variables (components) of X such that: As a generalization, let A be m x n matrix whose elements are called regressive coe cients as follows: The equation above is regression function in which Y is called responsor and X is called regressor whereas A is called regressive matrix. The assumption is combinatorial assumption (CA) aforementioned and the method proposed here is called CA method or CA algorithm. Suppose Y distributes normally with mean A 0 +ÃX and covariance matrix Sas follows: The m x m covariance matrix S is: The expression {\left(\tilde{A}X\right)}^{T}{S}^{-1}\tilde{A}X is approximated with µ as follows: As a result, f 0 (X, Y | Θ) and f(X, Y | Θ) is approximated as follows: The approximation by removing X-dependency from the expression {\left(\tilde{A}X\right)}^{T}{S}^{-1}\tilde{A}X is reasonable because the PDF f(X | Y, Θ) contains second-order proportion with the built-in expression {\left(X-\mu \right)}^{T}{{\Sigma }}^{-1}\left(X-\mu \right) and this PDF also re ects regression model with another built-in expression {\left(Y-{A}_{0}\right)}^{T}{S}^{-1}\tilde{A}\left(X-\mu \right) including parameter S 2 . In other words, the dependency of {\left(\tilde{A}X\right)}^{T}{S}^{-1}\tilde{A}X on X is unnecessary. Moreover, EM process will adjust parameters by the best way later. In following proofs and computations, we will see that such dependency removal also makes the research easy to apply shifted Gaussian integral. The It requires to calculate B to determine f(X | Y, Θ). By referring the appendix, we can denote: B\cong E\left({f}_{0}\left(X,y|{\Theta }\right)\right) Obviously, B is totally determined. Thus, f(Y | Θ) is approximated as follows: As a result, the PDF f(X | Y, Θ) is approximated as follows: Let k(Y|Θ) be the constant with subject to X but it is function of Ywith parameter Θ, which is de ned as follows: Shortly, the conditional PDF f(X | Y, Θ (t) ) is speci ed (approximated) at E-step of some t th iteration process as follows: Consequently, the expectation Q(Θ | Θ (t) ) at E-step of some t th iteration is totally determined. At M-step of current t th iteration, Q(Θ|Θ (t) ) is maximized by setting its partial derivatives regarding Θ to be zero. The rst-order partial derivative of Q(Θ | Θ (t) ) with regard to µ with note that Q(Θ | Θ (t) ) is analytic function is: The next parameter µ (t+1) at M-step of some t th iteration that maximizes Q(Θ|Θ (t) ) is solution of the equation \frac{\partial Q\left({\Theta }|{{\Theta }}^{\left(t\right)}\right)}{\partial \mu }={0}^{T}, with note that 0 = (0, 0,…, 0) T is zero vector, as follows: The rst-order partial derivative of Q(Θ | Θ (t) ) with regard to Σ with note that Q(Θ | Θ (t) ) is analytic function is: The rst-order partial derivative of Q(Θ | Θ (t) ) with regard to A 0 with note that Q(Θ | Θ (t) ) is analytic function is: Note, the next parameter µ (t+1) is speci ed by equation 2.7. The rst-order partial derivative of Q(Θ | Θ (t) ) with regard to \tilde{\alpha } with note that Q(Θ | Θ (t) ) is analytic function is (Saliba, 2016 (2.10) Note, µ is replaced by the current µ (t) . The equation 2.10 can be solved by Newton-Raphson method. The rst-order partial derivative of Q(Θ | Θ (t) ) with regard to S with note that Q(Θ | Θ (t) ) is analytic function is: Note, µ, A 0 and \tilde{A} are replaced by µ (t) , A 0 (t) and {\tilde{A}}^{\left(t\right)}, respectively. In general, CA method is EM process with two steps as follows: E-step: Determining the conditional PDF f(X | Y, Θ (t) ) speci ed by equation 2.6 based on current parameter Θ (t) . Calculating next parameters Θ (t+1) = (µ (t+1) , Σ (t+1) , A (t+1) , S (t+1) ) T based on f(X | Y, Θ (t) ) determined in the E-step, speci ed by equations 2.7, 2.8, 2.9, 2.10, and 2.11. In practice, it is not necessary to compute the covariance matrix Σ (t+1) and the variance S (t+1) because computational cost is high and it is also really ineffective to estimate Σ and S. Note, the condition that both Σ and S are invertible along with |Σ| ≠ 0 and |S| ≠ 0 is not easy to assert over many computational iterations. The most important parameters are µ and A and we should x the other parameters Σ and S with hints of prede ned bias or background knowledge.

Discussions And Conclusions
Although combinatorial assumption (CA) is subjective assumption, it will reach high usefulness and high effectiveness if there is strong regressive relationship between observed data and hidden data in many cases. The regression function may not be linear but it is easy to convert nonlinear functions into linear function in some cases. For instance, there are some nonlinear functions as follows: Given logarithm function, the transformation is y={\alpha }_{0}+{\tilde{\alpha }}^{T}U and u j = log(x j ). For exponent function, the transformation is v={\alpha }_{0}+{\tilde{\alpha }}^{T}X and v = log(y). For product function, the transformation is v= {\beta }_{0}+{\tilde{\alpha }}^{T}U, v = log(y), u j = log(x j ), and β 0 = log(α 0 ).
Recall that the dimensions of X and Y are n and m, respectively. Note, if m ≥ n, there is no information loss which is ideal case of CA method. Otherwise, if m < n, the important parameters such as µ and A vary much with large amplitude due to information loss of dimension reduction. The problem is called large-scale variation. Therefore, in practice there should be restrictions on µ and A speci ed as constraints u(µ) ≤ 0 and v(A) ≤ 0; for example, slope of regressive hyperplane speci ed by the normal vectors -{\tilde{A}}_{i}, which varies in a prede ned interval, is a constraint. The solution is to apply Lagrange duality method into maximizing Q(Θ|Θ (t) ), in which a Lagrange function l(Θ, κ, λ | Θ (t) ) is de ned as sum of Q(Θ|Θ (t) ) and constraints u ( Note Q(Θ|Θ (t) ) is maximized indirectly by maximizing the Lagrange function l(Θ, κ, λ | Θ (t) ), in which κ ≥ 0 and λ ≥ 0 called Lagrange multipliers are concerned. Another simple trick to alleviate the large-scale variation of µ and A is to initialize appropriate values µ (0) and A (0) at the rst iteration of EM process.
An application of CA is to learn Bayesian parameter. According to Bayesian statistics, the parameter Θ is considered as random variable. Given sample D = {X 1 , X 2 ,…, X N ) whose all observations X i (s) are iid, the posterior probability of Θ given D denoted P(Θ | D, ξ) is calculated based on D and the prior probability of Θ denoted P(Θ | ξ).
P\left({\Theta }|D,\xi \right)=\frac{P\left(D|{\Theta },\xi \right)P\left({\Theta }|\xi \right)}{P\left(D|\xi \right)} Where ξ denotes background knowledge about Θ which can be considered sub-parameter but we do not focus on learning ξ. Let X be random variable representing all X i . The most popular method to learn Bayesian parameter is to use binomial sample and beta distribution. Here CA method considers the parameter Θ as hidden data and random variable X as observed data.
In general, CA is an extension of EM, whose strong point is feasibility and simplicity with clearness of mathematic formulas. Especially, in the ideal case that dimension of Y (y) is larger than or equal to dimension of X, CA method will result out best estimates because there is no information loss. However, its drawback is the large-scale variation in determining important parameters µ and A when dimension of Y (y) is smaller than dimension of X. Therefore, in the future I will research Lagrange duality method to maximize the expectation Q(Θ|Θ (t) ) with constraints on µ and A.

Appendix
This appendix focuses on calculating integrals related to the following Gaussian function: Suppose X distributes normally with mean µ = (µ 1 , µ 2 ,…, µ h ) T , the product x j x l is approximated by µ j µ l and then, the quantities y i 2 and z i 2 are re-written as follows: