Nonlinear Covariance Matrices

The main (and in some cases the only) point of the optimization methods is to answer the question: for what parameters is this function maximized (minimized)? However, most of the interesting optimizations depend upon sample data. Assuming that there is some randomness in the observed data, the point estimates will vary depending upon the sample. This means that providing the point estimates alone will not fully describe the available information—it is also desirable to report some measure of the sensitivity of the estimates to the sample.

The only optimization techniques available in RATS that can provide estimates of the standard errors and covariance matrix are the hill-climbing methods. The derivative-free methods can produce point estimates only.

The Basic Framework

With most of the non-linear estimation instructions in RATS, the function being optimized takes the form

\begin{equation} F\left( \beta \right) = \sum\limits_{t = 1}^T {f\left( {y_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} \end{equation}

The exceptions to this are FIND, where you provide \(F\left( \beta \right)\) directly, and CVMODEL, where you provide sufficient statistics for such a sum.

If \(\beta_T\) is the estimate using data through \(T\) and \(\beta_0\) is the true parameter, then a first-order Taylor expansion of the gradient is

\begin{equation} \frac{{\partial F}}{{\partial \beta }}\left( {\beta _T } \right) \approx \frac{{\partial F}}{{\partial \beta }}\left( {\beta _0 } \right) + \frac{{\partial ^2 F}}{{\partial \beta {\kern 1pt} {\kern 1pt} \partial \beta '}}\left( {\beta _T - \beta _0 } \right) \end{equation}

The left-hand side of this is being set to zero by optimization, so we can solve to get

\begin{equation} \sqrt T \left( {\beta _T - \beta _0 } \right) \approx - \left( {\frac{1}{T}\frac{{\partial ^2 F}}{{\partial \beta {\kern 1pt} {\kern 1pt} {\kern 1pt} \partial \beta '}}} \right)^{ - 1} \frac{1}{{\sqrt T }}\frac{{\partial F}}{{\partial \beta }}\left( {\beta _0 } \right) \end{equation}

Using arguments similar to those for asymptotics in linear models, the second factor on the right can be written

\begin{equation} \frac{1}{{\sqrt T }}\sum\limits_{t = 1}^T {\frac{{\partial f}}{{\partial \beta }}\left( {y_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} \end{equation}

where the summands have expected value zero. Again, under the proper conditions (see, for instance, Hayashi, 2000, section 7.3) this will have an asymptotic Normal distribution with mean zero and a covariance matrix that can be estimated consistently by

\begin{equation} {\bf{B}}_T = \frac{1}{T}\left\{ {\sum\limits_{k = - L}^L {\sum\limits_t {\kern 1pt} \frac{{\partial f}}{{\partial \beta }}\left( {y_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)^\prime \frac{{\partial f}}{{\partial \beta }}\left( {y_{t - k} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\bf{X}}_{t - k} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} } \right\} \label{eq:nonlin_bigFCovMat} \end{equation}

Combined with

\begin{equation} {\bf{A}}_T = \frac{1}{T}\sum\limits_t {\kern 1pt} \frac{{\partial ^2 f}}{{\partial \beta {\kern 1pt} {\kern 1pt} \partial \beta '}}\left( {y_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right) \label{eq:nonlin_bigFAmatrix} \end{equation}

we get the familiar “sandwich” expression

\begin{equation} \sqrt T \left( {\beta _T - \beta _0 } \right)\sim N\left( {0,{\bf{A}}_T^{ - 1} {\kern 1pt} {\bf{B}}_T {\kern 1pt} {\bf{A'}}_T^{ - 1} } \right) \label{eq:nonlin_bigFSandwich} \end{equation}

\(\bf{B}_T\) can be calculated from the derivatives computed during each iteration. It’s \(\bf{A}_T\) that we don’t necessarily have. The only hill-climbing method that computes it is Newton–Raphson, which is used only by DDV and LDV. However, BFGS produces an approximation to the (inverse) Hessian. (Under very limited situations, it produces an exact value). If a function is only locally quadratic (which is a required assumption to get the covariance matrix using this type of analysis), then what BFGS produce is an approximation. But be careful: if you start BFGS very close to the final estimates, it won’t have a chance to “map out” the curvature, as the gradient will already be quite close to zero. To give BFGS a decent chance to develop the curvature estimates, you may need to start with initial conditions pulled away from the optimum.

Simplifying Assumptions

In many cases, the calculation of the covariance matrix can be simplified. In some (for instance, with CVMODEL or FIND) it has to be, as the information to compute \(\bf{B}_T\) isn’t available. All of these start by assuming the \(L\) in \eqref{eq:nonlin_bigFCovMat} is zero. A non-zero \(L\) means that the partial derivatives are correlated across time. If the observations are independent, or the model is assumed to exhaust any correlation, this should be a safe assumption.

If \(F\left( \beta \right)\) is the log-likelihood function, then the information equality applies (see, for instance, Hamilton, 1994, section 14.4). \(\bf{A}_T\) and \(\bf{B}_T\) have the same expected values; they aren’t, however, equal. Whichever one we can compute most easily can substitute for the other. In the case of METHOD=BHHH, we are computing \(\bf{B}_T\) (with \(L=0\)) at every iteration to get the \(\bf{G}\) matrix in the hill-climbing procedure. For METHOD=BFGS, it’s \(\bf{A}_T\) (more accurately, its inverse) that we’re computing (approximately).

Cautions

Different methods of estimation will generally produce point estimates which agree to the precision that you request. The covariance matrices, however, can differ substantially. It’s not uncommon for standard errors to disagree at the second or third significant digit. This is because they are all based upon the asymptotic equivalence of matrices that are not identical in sample.

By default, MAXIMIZE assumes that the information equality applies. If \(F\left( \beta \right)\) is not the log likelihood function (other than missing some additive constants), the simplifying assumptions above are wrong, and the covariance matrix is likely to be quite wrong as well. For instance, if you use MAXIMIZE to fit a least squares model by maximizing \(\sum {\left( { - u_t^2 } \right)} \), the point estimates will be correct, but the standard errors will be off by a factor of \(\sqrt {2\sigma ^2 } \), since the actual log likelihood element is \(- .5\left( {{{u_t^2 } \mathord{\left/{\vphantom {{u_t^2 } {\sigma ^2 }}} \right.} {\sigma ^2 }}} \right)\), not just \(- u_t^2 \).

Because FIND can be used for optimization problems which are not based upon sample data, it doesn’t automatically compute and display standard errors. You need to include the STDERRS option if you want it to do so and you must be using METHOD=BFGS. The calculation of these assumes the information equality holds.

NLLS and NLSYSTEM

The covariance matrix calculations for NLLS and NLSYSTEM are similar to those above but exploit the special structure of the problem. For instance, for NLLS,

\begin{equation} F\left( \beta \right) = \sum\limits_{t = 1}^T {\left( {y_t - f\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} \right)^2 } \end{equation}

\begin{equation} \frac{{\partial F}}{{\partial \beta }} = - 2\sum\limits_{t = 1}^T {\left( {y_t - f\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} \right)} \frac{{\partial f}}{{\partial \beta }}\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right) \label{eq:nonlin_gngradient} \end{equation}

\begin{equation} \frac{{\partial ^2 F}}{{\partial \beta {\kern 1pt} {\kern 1pt} \partial \beta^{\prime}}} = 2\sum\limits_{t = 1}^T {\frac{{\partial f}}{{\partial \beta }}\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)^{\prime} } \frac{{\partial f}}{{\partial \beta }}\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right) - 2\sum\limits_{t = 1}^T {\left( {y_t - f\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right)} \right)} \frac{{\partial ^2 f}}{{\partial \beta {\kern 1pt} {\kern 1pt} \partial \beta^{\prime} }}\left( {{\bf{X}}_t ,{\kern 1pt} {\kern 1pt} {\kern 1pt} \beta } \right) \label{eq:nonlin_gnfullHessian} \end{equation}

The second sum in \eqref {eq:nonlin_gnfullHessian} is the sum of the product of residuals times a function of the exogenous variables only. When divided by \(T\), it isn’t unreasonable to assume that this will become negligible in large samples. Thus, this term is ignored in computing \(\bf{A}_T\), which eliminates the need to compute second derivatives. The convenient simplification here is to assume that

\begin{equation} E\left( {\left. {u_t ^2 } \right|X_t } \right) = \sigma ^2 \end{equation}

which, when applied to \eqref{eq:nonlin_gngradient} gives us

\begin{equation} E\left( {{\bf{B}}_T } \right) = \sigma ^2 E\left( {{\bf{A}}_T } \right) \end{equation}

ROBUSTERRORS, LWINDOW, LAGS and CLUSTER

The ROBUSTERRORS option is available on the instructions MAXIMIZE, NLSYSTEM, NLLS, LDV, DDV and GARCH to do the full covariance matrix calculation described by \eqref{eq:nonlin_bigFCovMat}, \eqref {eq:nonlin_bigFAmatrix} and \eqref{eq:nonlin_bigFSandwich}. This can only be used with METHOD=BFGS on MAXIMIZE and METHOD=GAUSS (the default) for NLLS and NLSYSTEM.

The LAGS option, when used together with ROBUSTERRORS, gives a non-zero value to \(L\) in \eqref{eq:nonlin_bigFCovMat}. This is subject to the same caution as for linear models: if you don’t use a choice for LWINDOW like NEWEYWEST, the covariance matrix computed could fail to be positive definite.

ROBUSTERRORS with CLUSTER=SERIES with category values allows for arbitrary patterns of correlation in the gradient within the categories given by the values of the series you provide. You must have at least as many categories as parameters to estimate, and generally should have many more.

For maximum likelihood estimators, the use of ROBUSTERRORS is sometimes referred to as QMLE (Quasi-Maximum Likelihood Estimation) as the use of ROBUSTERRORS is generally a concession that the true likelihood isn't known; however, there are many models for which the standard MLE's are consistent even if the likelihood isn't known.