Statistics and Algorithms / Non-Linear Optimization /

EM Algorithm

The EM algorithm (Dempster, Laird and Rubin, 1977) is a general method for handling situations where you have a combination of observed data and unobserved or latent data, and where the likelihood would have a convenient form if the unobserved data were known. It’s an iterative algorithm, which repeats the pair of an E (or Expectations) step and an M (or Maximization) step.

Let \(y\) represent the full record of the observed data and \(x\) the full record of the unobserved data. Let \(\Theta \) represent the model parameters being estimated. We’re assuming that the log likelihood as a function of both \(x\) and \(y\) is \(\log f\left( {y,x|\Theta } \right)\) which is itself fairly easy to handle. Unfortunately, since we only see \(y\), maximum likelihood needs to maximize over \(\Theta \) the log marginal density:

\begin{equation} \log f\left( {y|\Theta } \right) = \log \,\int {f\left( {y,x|\Theta } \right)} \,dx \label{eq:nonlinEMMarginal} \end{equation}

which can be quite difficult in the (non-trivial) case where \(x\) and \(y\) aren’t independent. The result in the DLR paper is that repeating the following process will (under regularity conditions) eventually produce the maximizer of \eqref{eq:nonlinEMMarginal}:

E Step: Let \(\Theta _0 \) be the parameter values at the end of the previous iteration. Determine the functional form for

\begin{equation} Q\left( {\Theta |\Theta _0 } \right) = \mathop E\limits_{x|y,\Theta _0 } \log f\left( {y,x|\Theta } \right) \label{eq:nonlin_EMQDef} \end{equation}

M Step: Maximize \(Q\left( {\Theta |\Theta _0 } \right)\) with respect to \(\Theta\).

The E step is, in many cases, simpler than the calculation in \eqref{eq:nonlinEMMarginal} because it integrates the log density, which often is a much simpler functional form than the density itself. Where the augmented data set \((y,x)\) is jointly Normal, a small simplifying assumption in the E step allows you to use the even simpler

\begin{equation} \tilde Q\left( {\Theta |\Theta _0 } \right) = \log f\left( {y,E\left( {x|y,\Theta _0 } \right)|\Theta } \right) \end{equation}

that is, you just do a standard log likelihood maximization treating \(x\) as observed at its expected value given the observable \(y\) and the previous parameter values \(\Theta _0 \).

While EM can be very useful, it does have some drawbacks. The most important is that it tends to be very slow reaching final convergence. While \eqref{eq:nonlin_EMQDef} can often be maximized with a single matrix calculation (if, for instance, it simplifies to least squares on a particular set of data), that is only one iteration in the process. The next time through, the function changes since \(\Theta _0 \) has changed. It also (for much the same reason) gives you point estimates but not standard errors.

The best situation for EM is where the model is basically a large, linear model controlled (in a non-linear way) by the unobserved \(x\). Even if \eqref{eq:nonlinEMMarginal} is easily computable, maximizing it by variational methods like BFGS can be very slow due to the size of the parameter set. Because EM takes care of the bulk of the parameters by some type of least squares calculation, the iterations are much faster, so taking more iterations isn’t as costly. And in such situations, EM tends to make much quicker progress towards the optimum, again, because it handles most of the parameters constructively. Where possible, the ideal is to use EM to get close to the optimum, then switch to full maximum likelihood, which has better convergence properties, and gives estimates for the covariance matrix.

However, don’t overuse it. Some papers recommend using EM because the author didn’t have access to software that could directly maximize \eqref{eq:nonlinEMMarginal}. If the original problem can be handled by maximum likelihood, try that first. Only try EM if that proves to be too slow.

Because of the nature of algorithm, which has the maximization inside of an iterative process, EM usually involves a DO loop over the calculations, although you can sometimes write \eqref{eq:nonlin_EMQDef} in a form that you can input into MAXIMIZE. However, you need to insert the E step calculation between each iteration. You can’t use the START option, as it is associated with each function evaluation, not each iteration.

Instead, there’s a separate (similar) option called ESTEP. This is available on DLM, MAXIMIZE, NLLS and NLSYSTEM, though it is most often used with MAXIMIZE. MAXIMIZE with ESTEP is often simpler to set up than the “classical” use of EM with specialized calculations for optimizing \eqref{eq:nonlin_EMQDef}. It will always be slower to execute than if you work out the simple maximizers in the M step (it’s basically doing the same optimization without knowing the structure of the problem), but you should only worry about execution time if execution is too slow. If you use MAXIMIZE with ESTEP, use METHOD=BHHH, not (the default) METHOD=BFGS. Because the function being optimized changes from iteration to iteration, the BFGS update steps aren’t valid.

A simple example of this is provided in EMEXAMPLE.RPF. It's extensively used in Markov Switching VAR estimation as the overall parameter sets for those can be intractably large, but EM can handle much of it with a multivariate linear regression.