Statistics and Algorithms / Non-Linear Optimization /

BFGS Algorithm

The BFGS algorithm (Broyden, Fletcher, Goldfarb and Shanno) described in (described in Press, et. al., 2007) is the workhorse optimization algorithm for general maximum likelihood, not just in RATS, but in many other statistical packages, as it works quite well in a broad range of applications. As a hill-climbing algorithm, BFGS requires that the function being optimized be twice-continuously differentiable which will be the case for most log likelihoods that you will encounter. The function will have a second order Taylor expansion around \(\theta _0\):

\begin{equation} f(\theta ) \approx f({\theta _0}) + f'({\theta _0}) \bullet (\theta - {\theta _0}) + (1/2)(\theta - {\theta _0})'f''({\theta _0})(\theta - {\theta _0}) \end{equation}

If \(f\) were quadratic, then \(f''({\theta _0})\) (the Hessian) would be a constant matrix \(\bf{Q}\), and the optimum could be found directly by solving

\begin{equation} \theta = {\theta _0} - {{\bf{Q}}^{ - 1}}f'({\theta _0}) \end{equation}

no matter what start values we have for \({\theta _0}\). If we start near the optimum, then the same calculation should be at least close to finding the top even if the function is only locally quadratic. There are two main problems in practice:

1.For a non-quadratic function, what happens if we're not near the optimum?

2.It may not be easy to compute \(f''({\theta _0})\).

The gradient can usually be computed fairly accurately by numerical methods. Numerical second derivatives are much less accurate and require quite a bit of extra calculation—you can compute the gradient in \(K\) space with \(K+1\) function evaluations (a base plus a slightly perturbed value in each direction), but the second derivative requires an extra \(K(K + 1)/2\) to fill up a \(K \times K\) symmetrical array.

The key result for the BFGS algorithm is that you can get an increasingly accurate estimate of the Hessian (\({f''}\)) without ever computing it directly by seeing how the gradient changes from one iteration to the next. The precise result is that if the function is actually quadratic and you do \(K\) iterations of BFGS with exact line searches at each stage, then at the end, you will have built \(\bf{Q}\) exactly (and thus will have also found the optimum).

In practice, the function isn't globally quadratic, and we generally don't (for efficiency) do "exact'' line searches, so the algorithm will not converge exactly in \(K\) iterations and the final \({f''}\) will only be approximate. However, with rare exceptions, if the function is, indeed, twice continuously differentiable and has a negative definite Hessian (no singularities) at the local maximum on the starting "hill", then BFGS will find its way to the top of that hill.

Initialization

RATS actually builds up an estimate of the positive-definite matrix \(-\bf{Q}\), the inverse of which is the \(\bf{G}\) used in the hill-climbing algorithm. This starts with a diagonal matrix, the form of which varies from instruction to instruction, but uses what information is available on the scale of each parameter. Because \(\bf{G}\) is used in estimating the covariance matrix and standard errors of coefficients, you need to be careful not to apply BFGS to a model which has already been estimated to a fairly high level of precision. If BFGS uses fewer iterations than the number of parameters being estimated (note that even under the best conditions, it requires a full set of \(K\) iterations to exactly match the Hessian), this will be poorly estimated and, hence, the standard errors derived from them will be incorrect.

All of the instructions which allow BFGS will let you use the HESSIAN option to set an initial guess for (minus the inverse Hessian) which is what will be used as the hill-climbing \(\bf{G}\). If you have a model which is nearly converged, and you use an initial \(\bf{G}\) which is close to the correct matrix as well (for instance, using %XX from a previous estimation), you can have a bit more confidence in the resulting covariance matrix. If you don’t have a better initial \(\bf{G}\), move your guess values for the parameters away from the optimum to force a longer estimation sequence.