Information Criteria

The following are the formulas used for different information criteria. The first column is generally used in anything which generates a log likelihood (log L is the sample value for that), while the second can be used for models estimated by least squares. The two give identical orderings of models assuming a Normal likelihood.

\(k\) is the number of estimated parameters (or regressors) and \(T\) is the number of observations. It's also possible (and in many ways desirable) to "standardize" these by dividing the expression by \(T\) which gives the statistic a more manageable scale while not changing the relative order of models. That's what the @REGCRITS procedure does.

You're looking for the model which minimizes the chosen criterion. Note that you should always use the same sample range for all models considered. If you don't, you can be biasing your decision in favor of one model or another based mainly upon the number of data points used. Note also that you need to be careful that the values of "k" are also compatible. For instance, the IC formulas for least squares models usually don't count the variance as part of \(k\) (the variance is concentrated out in least squares), which is fine if you're comparing least squares models with each other. However, if you're comparing a least squares model with (for instance) a GARCH model, the variance needs to be included in the parameter count for the regression, since the GARCH model is explicitly including parameters to describe the variance. The RATS variable %NFREE (which is used by @REGCRITS) includes variance (or covariances for multivariate models) that are concentrated out in the estimation, so @REGCRITS will give you comparable parameter counts if you're using different general model types.

AIC (Akaike Information Criterion)	\( - 2\log L + k \times 2\)	\(T\log \,{{\hat \sigma }^2} + k \times 2\)
SBC or BIC (Schwarz Bayesian Criterion, or Bayesian Information Criterion)	\( - 2\log L + k \times \log T\)	\(T\log \,{{\hat \sigma }^2} + k \times \log T\)
HQ (Hannan-Quinn)	\( - 2\log L + k \times 2\log \left( {\log T} \right)\)	\(T\log \,{{\hat \sigma }^2} + k \times 2\log \left( {\log T} \right)\)
FPE (log) (Final Prediction Error)	\( - 2\log L + T\log \left( {\frac{{T + k}}{{T - k}}} \right)\)	\(T\log \,{{\hat \sigma }^2} + T\log \left( {\frac{{T + k}}{{T - k}}} \right)\)

All of these take minus twice the log likelihood and add a function of \(k\) and \(T)\) that "penalizes" additional parameters. Except for very small values of \(T\), the penalty is smallest for AIC, larger for HQ and largest for SBC. As a result, AIC will never pick a model smaller than HQ which will never pick a model smaller than SBC. (FPE will generally give very similar results to AIC).

There are conflicting theoretical results about which of these is “better.” If the correct model is included in the collection of models examined, SBC will, given enough data, choose it, while AIC won’t do so necessarily—even in very large samples, it can pick models which are too big. (SBC is “consistent”, AIC isn’t). However, if the correct model isn’t included (for instance, the actual lag length is infinite), then AIC proves to be better at picking an approximate model. (AIC is “efficient”, SBC isn’t). For more information, see the discussion in Brockwell and Davis (1991). The intermediate penalty (HQ) has certain optimality properties in choosing lag length in autoregressions, in particular.

Information criteria are used in the procedures @BJAUTOFIX and @GMAUTOFIX for choosing Box-Jenkins moidels, and are options for lag length choices in a number of other procedures such as @DFUNIT and @VARLAGSELECT.

Applicability

These are all based upon taking some theoretically reasonable (but often practically uncomputable) criterion for choosing among models and eliminating terms which are asymptotically "negligible", leaving the log likelihood term and the penalty term. There are other assumptions but if you are dealing with a situation where a likelihood ratio test between competing models would have a non-standard distribution, it would break the underlying assumptions. Among these are:

•Decisions on "unit root" behavior: unit root or not, rank of cointegration. The @BAYESTST procedure offers a (fairly complicated) calculation of a version of SBC for a unit root test.

•Decisions on a boundary value, such as variance=0 vs free

•Decisions regarding models with unidentified parameters under some circumstances, such as switching vs non-switching.

Note that what matters is whether you are making a decision which directly involves the "non-standard" asymptotics. Picking a lag length within a model will almost always be fine, even if the model as a whole has non-standard behavior.

Comparison with Hypothesis Testing

If two competing models are nested, then the difference between the AIC's will be

\(2(\log L_1 - \log L_0 ) - 2 (k_1 - k_0 )\)

where 1 represents the larger model and 0 the smaller one. The first term would be the likelihood ratio test statistic for a null of the smaller model against an alternative of the larger one. The second is twice the degrees of freedom of that test. (For SBC, the 2 in the second term would be replaced by \(\log T\)). We would prefer the larger model to the smaller one if the likelihood ratio statistic exceeds 2 times the degrees of freedom. If, instead of using the IC to compare the models, we did a likelihood ratio test, we would compare the likelihood ratio with the desired percentile (typically .95, that is, the .05 tail) of the \(\chi ^2\). The critical values of the chi-squared go up more like 1 times the degrees of freedom (because the mean is equal to the degrees of freedom, but the standard deviation goes up only with the square root). So the implied penalty in the likelihood ratio test is smaller than for AIC (except at small differences in the number of parameters). In a model where the number of parameters can get quite large (such as VAR's), this can produce "conflicting" results between AIC and formal tests. See VARLAG.RPF for an example.