Statistics and Algorithms / Vector Autoregressions /

VAR: Forecasting, a Bayesian Approach

Problems with Unrestricted Models for Forecasting

Forecasts made using unrestricted vector autoregressions often suffer from the overparameterization of the models. The number of observations typically available is inadequate for estimating with precision the coefficients in the VAR. This overparameterization causes large out-of-sample forecast errors. For instance, see Fair (1979) for a comparison of forecasting performance of several models that includes an unrestricted VAR.

One possible way to handle this is to use some criterion to choose “optimal” lag lengths in the model. Examples are the Akaike or Schwarz Criterion. Example VARLAG.RPF shows how these can be used for choosing an overall lag length. However, there are $N^2$ lag lengths which need to be chosen in a full model, which makes a complete search using such a criterion effectively impossible.

No method of applying direct restrictions (like PDL’s for distributed lags) seems reasonable. Vector autoregressions are, in effect, dynamic reduced forms, not structural relations, so the meaning of the values for coefficients is not obvious, and no “shape” constraint suggests itself.

A Bayesian Approach

The Bayesian approach to this problem is to specify “fuzzy” restrictions on the coefficients, rather than “hard” shape or exclusion restrictions. Shrinkage estimators such as this have long been suggested for dealing with multicollinearity and similar problems.

Litterman(1986) suggests the following: the usual approach to degrees of freedom problems is to reduce the number of regressors, which in autoregressive models means using fewer lags. Dropping a lag forces its coefficient to zero. Rather than adopting an all-or-nothing approach, we “suggest” that coefficients on longer lags are more likely to be close to zero than shorter lags. However, we permit the data to override our suggestion if the evidence about a coefficient is strong. Formally, we implement this by placing on the long lags Normal prior distributions with means of zero and small standard deviations. This allows us to estimate the coefficients using Theil's mixed estimation technique.

In a vector autoregression, we must concern ourselves not only with lags of the dependent variable, but also with lags of the other endogenous variables. Because of stability conditions, we have some pretty good information about the size of lag coefficients in a simple autoregression. However, it’s not as clear what the sizes of coefficients on other variables should be, and these depend, in part, on the relative scales of the variables involved.

Specification of a complete Normal prior on a VAR would be intractable because the covariance matrix of the prior would have dimensions $N^2 L \times N^2 L$. Instead, we have put together a general form for the prior involving a few hyperparameters and, in effect, you estimate your model by choosing those.

You add the prior to your VAR by including the instruction SPECIFY in your system definition:

system(model=canusa)

variables usam1 usatbill canm1 cantbill canusxr

lags 1 to 13

det constant

specify(type=symmetric,tightness=.15) .5

end(system)

Standard Priors

In the discussion in this section, variable j refers to the $j^{th}$ variable listed on the VARIABLES instruction and equation i to the equation whose dependent variable is variable i.

The standard priors have the following characteristics:

•The priors put on the deterministic variables in each equation are non-informative (flat).

•The prior distributions on the lags of the endogenous variables are independent Normal.

•The means of the prior distributions for all coefficients are zero. The only exception is the first lag of the dependent variable in each equation, which has a prior mean of one, by default.

Because of these restrictions, the only information required to construct the prior is:

•the mean of the prior distribution for the first own lag in each equation.

•the standard deviation of the prior distribution for lag $l$ of variable $j$ in equation $i$ for all $i$, $j$ and $l$: denoted $S(i, j,l)$.

To simplify further the task of specifying the prior, the standard priors restrict the standard deviation function to the form:

\begin{equation} S(i,j,l) = \frac{{\left\{ {{\kern 1pt} \gamma {\kern 1pt} g(l)f(i,j){\kern 1pt} } \right\}s_i }}{{s_j }};\,\,\,\,\,\,\,\,\,\,\,f(i,i) = g(l) = 1.0 \label{eq:var_bvargeneralform} \end{equation}

where ${s_i }$ is the standard error of a univariate autoregression on equation $i$. We scale by the standard errors to correct for the different magnitudes of the variables in the system.

The part in braces is the tightness of the prior on coefficient $i$, $j$, $l$. It is the product of three elements:

1.The overall tightness [$\gamma$], which, because of the restrictions on the $f$ and $g$ functions, is the standard deviation on the first own lag.

2.The tightness on lag $l$ relative to lag 1 ${g(l)}$.

3.The tightness on variable $j$ in equation $i$ relative to variable $i$ ${f(i,j){\kern 1pt} }$.

The Prior Mean

By default, we use a mean of zero for the prior on all coefficients except the first own lag in each equation. The prior on this coefficient is given a mean of one. This centers the prior about the random walk process

This seems to be a reasonable choice for many economic time series. For example, there are theoretical reasons for various asset prices to follow random walks, and a series growing exponentially at rate a can be represented by

\begin{equation} Y_t = Y_{t - 1} + u_t \end{equation}

so its log follows a random walk with drift.

The only alternative which strongly suggests itself is a mean of zero on series which are likely to be close to white noise, such as a series of stock returns.

The MEAN and MVECTOR options of SPECIFY control the first own lag means. MVECTOR is more likely to be useful since it permits means to vary across equations:

specify(mvector=||1.0,0.0,1.0,1.0,1.0||,type=symmetric) .5

The MVECTOR option here puts a mean of 0.0 on the second equation and 1.0 on all the rest.

Lag Length and Lag Decay

Experience with VAR models has shown that it usually is better to include extra lags with a decaying lag prior rather than to truncate at an early lag and use the default prior. Unless you face severe data constraints, we would suggest using at least one year + one period of lags. Longer lags without any decay are a bad idea.

To tighten up the prior with increasing lags, use TYPE=HARMONIC or TYPE=GEOMETRIC with an appropriate value for DECAY. A HARMONIC with DECAY=1.0 or DECAY=2.0 commonly works well. GEOMETRIC tends to get too tight too fast.

Highly seasonal data are difficult to work with using these standard priors because you typically expect relatively large coefficients to appear on the lags around the seasonal. See Dealing with Seasonality for some tips on dealing with seasonal data.

Overall Tightness

The default is the relatively “loose” value of 0.20. Practice has shown that a reasonable procedure is to set the TIGHTNESS parameter to something on the order of .1 or .2; then, if necessary, you can tighten up the prior by giving less weight to the other variables in the system. Since TIGHTNESS controls directly the important own lags, setting it too small will force the own lags too close to the prior mean.

Prior Type

The prior type determines the relative tightness function $f (i, j)$: the tightness on variable $j$ in equation $i$ relative to that on variable $i$. The two types of priors (selected using the TYPE option) are the SYMMETRIC and the GENERAL.

TYPE=SYMMETRIC is the simplest prior and is the default. There is only one free hyperparameter; indicated with the other’s weight parameter on SPECIFY, it gives the relative weight ($w$) applied to all the off-diagonal variables in the system:

\begin{equation} f(i,j) = \left\{ {\begin{array}{*{20}c} {1.0} \hfill & {{\rm{if}}\,i = j} \hfill \\ w \hfill & {{\rm{otherwise}}} \hfill \\ \end{array}} \right. \end{equation}

The symmetric priors generally are adequate for small systems: those with five or fewer equations. A combination of TIGHTNESS=.20 and w=0.5 is a common choice. As you push w to zero, the system approaches a set of univariate autoregressions—coefficients on all variables other than the own lags of the dependent variable and the deterministic part are forced to zero. The following is an example of SYMMETRIC:

system(model=smallmod)

variables gnp m1 ipd tbill unemp bfi

lags 1 to 4

det constant

specify(type=symmetric,tight=.2) .5

end(system)

TYPE=GENERAL requires that you specify the entire $f (i, j)$ function. Obviously, it is unrealistic to think of fine-tuning the prior by picking all of these independently. In fact, such a strategy simply transfers the problem with overparameterization from estimating too many equation coefficients to estimating too many hyperparameters.

Instead, you should use a GENERAL prior in situations where

•the model is too large to apply safely a SYMMETRIC prior. The TIGHTNESS=.2 and w=.5 recommended above tends to be too loose overall for a system with six or more equations. However, making w much smaller will cut out too much interaction. Use a GENERAL which puts moderate weight on variables which you see as being important and low weight on those you believe to be less important.

•the results of a SYMMETRIC show that you need to treat some equations more as univariate autoregressions than as part of the VAR. Use a GENERAL which is largely the same as the SYMMETRIC, but has small off-diagonal elements in the rows corresponding to these equations.

The $f (i, j)$ function is input in one of two ways: by a RECTANGULAR array (using the MATRIX option) or by supplementary cards.

system(model=ymrp)

variables gnp m1 cpr ppi

lags 1 to 12

det constant

specify(type=general,tight=.2,decay=1.0)

# 1.0 0.5 0.5 0.5 0.1 1.0 1.0 0.1 0.1 1.0 1.0 0.1 $

0.5 0.5 0.5 1.0

end(system)

declare rect priormat(4,4)

input priormat

1.0 0.5 0.5 0.5

0.1 1.0 1.0 0.1

0.5 0.5 0.5 1.0

specify(type=general,matrix=priormat,tight=.2,decay=1.0)

Estimation Methods

The VAR with a prior is estimated using the ESTIMATE instruction. This employs a variation on the mixed estimation procedure. It should be noted that, with this type of a prior, single equation techniques are not optimal except in the unlikely case that the residuals are uncorrelated. In the early 1980’s, when Bayesian VAR’s were introduced, system-wide techniques weren't feasible for any but the smallest models. And it’s still the case that full system estimators, done properly (see, for instance, the GIBBSVAR.RPF program) can be done only for medium-sized models, because of the size of the matrices which must be inverted.

The simpler calculations done by ESTIMATE (and KALMAN). The gains from more careful estimation are likely to be small, since it is only the combination of a prior and a non-diagonal covariance matrix that produces any gain at all. Our suggestion would be that you develop the model using the basic techniques and switch to the more computationally intensive methods only once the model has been built.

Differences with END(SYSTEM)

When you use a SPECIFY in setting up the system, the END(SYSTEM) instruction causes RATS to print a synopsis of the prior. For example:

Summary of the Prior...

Tightness Parameter 0.100000

Harmonic Lag Decay with Parameter 0.000000

Standard Deviations as Fraction of Tightness and Prior Means

Listed Under the Dependent Variable

LOGCANGDP LOGCANDEFL LOGCANM1 LOGEXRATE CAN3MTHPCP LOGUSAGDP

LOGCANGDP 1.00 0.50 0.01 0.01 0.20 0.50

LOGCANDEFL 0.50 1.00 0.01 0.01 0.20 0.50

LOGCANM1 0.50 0.50 1.00 0.01 0.20 0.50

LOGEXRATE 0.50 0.50 0.01 1.00 0.20 0.50

CAN3MTHPCP 0.50 0.50 0.01 0.01 2.00 0.50

LOGUSAGDP 0.50 0.50 0.01 0.01 0.20 1.00

Mean 1.00 1.00 1.00 1.00 1.00 1.00

Variables Defined by SPECIFY

%PRIOR

SPECIFY stores the matrix of weights and means in %PRIOR, an $(N + 1) \times N$ array, arranged as printed above. By making changes to %PRIOR, you can alter a standard prior without going through a complete redefinition of the SYSTEM.

Differences with ESTIMATE

RATS estimates the system of equations using the mixed estimation technique. The differences between ESTIMATE for systems with a prior and for systems without a prior are:

•The degrees of freedom reported are not $T-K$, where $K$ is the number of regressors, but $T-D$, where $D$ is the number of deterministic variables. This is a somewhat artificial way to get around the problem that, with a prior, $K$ can exceed $T$.

•With the option CMOM=SYMMETRIC array for X’X, you can obtain the array ${\bf{X'}}{\kern 1pt} {\bf{X}}$ of the regressors.

•With the option DUMMY=RECTANGULAR array of dummy observations, RATS saves the dummy observations used in doing the mixed estimation procedure in the specified array.

Choosing a Prior

Influence of Model Size

The larger is a model relative to the number of data points, the more important the prior becomes, as the data evidence on the individual coefficients becomes weaker. “Average” situations are models with 9 parameters per equation for 40 data points, 30 for 120, and 70 for 400. These models call for moderately tight priors. Substantially larger models for a given size require greater tightness through either:

•a lower value for TIGHTNESS, or

•the downweighting of the “other” variables, either through a tighter SYMMETRIC prior or through use of a GENERAL prior.

Objective Function

In searching over the parameters governing the prior, we need to have, formally or informally, an objective function. Because we are generating a forecasting model, the best forms for this are based upon forecast errors. Three have been used:

•Theil U values (computed with the instruction THEIL), which can be used formally, by mapping them to a single value by a weighted average, or informally, by examining changing patterns in the values.

•likelihood function of the data conditional on the hyperparameters. This is a by-product of the Kalman filter procedure.

•log determinant of the covariance matrix of out-of-sample forecast errors.

The last two are discussed in Doan, Litterman and Sims(1984). The third was used in that paper and is the most difficult to compute. Our preference is the informal use of Theil U statistics.

In all cases, we calculate simulated “out-of-sample” forecasts within the data range. We do this by using the Kalman filter to estimate the model using only the data up to the starting period of each set of forecasts.

A Simple Procedure

A simple procedure which we have found to be effective is the following:

1.Run a system of univariate OLS models to get benchmark Theil U’s.

2.Run a system of univariate models with a standard value for TIGHTNESS.

3.Run a standard SYMMETRIC prior.

Based upon these, adjust the prior (switching to a GENERAL prior):

•If the Theil U’s in an equation are worse in 2 than in 1, loosen up on the own lags by setting the diagonal element to 1.5 or 2.0.

•If the Theil U’s in an equation are worse in 3 than in 2, tighten up on the other variables by reducing the off-diagonal elements.

Selecting a Model

The selection of a Vector Autoregressive model for forecasting is more difficult than selection of an ARIMA or other univariate time series model. You have to make the following choices:

•which variables to include in the model

•lag length

•prior structure and hyperparameters

To look at a simple illustration, suppose that we are interested in the forecasts of the series, SALES. Previous tests have shown that SALES is closely related to national economic conditions. A first try at a model might be a two variable system with SALES and GDP. However, the model uses a one-step forecast of GDP to compute the two-step forecast of SALES. If the initial forecast of GDP is poor, then the forecasts of SALES derived from it are also likely to be unsatisfactory.

As a second attempt, you consider the addition of variables to improve the GDP forecasts. However, a problem arises as you add more variables: it seems that there are always still more variables which you could add to improve the forecasts of the existing variables. Obviously, you cannot include in the system every variable which might have an effect on SALES (through some route). At some point, you must decide which variables to include and which to exclude.

Choosing a Set of Variables

Although you can incorporate quite a few variables in a system through an appropriate choice of prior, it is still a good idea to restrict yourself to relatively small systems (3 to 5 variables) if you want to choose a model to forecast a single series. You should think of a list of candidate series. Usually, there are some obvious choices, such as GDP in the example above.

The instruction ERRORS can be very helpful in refining the forecasting model, especially when you have no strong prior information about which series will be important for the prediction of variables in question. There are two pieces of information it provides which are valuable:

The standard errors of forecast

You can use these for a quick comparison of the forecasting abilities of the VAR’s from several sets of possible variables. Check the computed standard errors for the variable(s) of greatest interest. The set that produces the lowest values should be regarded as the most promising.

The decomposition of variance

This can indicate which variables you might replace to lower the errors. A variable which explains a very low fraction of the target variable is a good candidate for replacement. If you use the decomposition in this manner, remember to consider the possible effects of changes in the ordering.

Using Partial VAR’s

Often it is unrealistic to cast the forecasting model in the form of a full VAR. For instance, it is probably reasonable to assume that SALES (of a single company) has no real explanatory power for economy-wide variables, so a GDP equation could omit SALES entirely.

If we return to the situation on the previous page, a possible model would consist of

1.a single equation that explains SALES by lagged SALES, lagged GDP, and perhaps one or two other variables, and

2.a separate VAR system that forecasts GDP and the other variables.

We can combine the SALES equation with the other equations to form the forecasting model. This has several advantages over a var system which includes SALES:

•We can estimate the VAR subsystem using all the data available for its variables. This may be quite a bit longer than the data record for SALES.

•We can give special attention to the single equation for SALES, particularly if there is a seasonality problem (see below).

However, it is not possible to put a prior on the coefficients of the SALES equation using SPECIFY. Instead, you can use the procedure @MIXVAR to estimate this equation separately, then combine it with the VAR for the national variables. @MIXVAR, in effect, estimates a single equation using a prior of the same type used in full VAR's.

system(model=national)

variables orddurgd ipelect prime

lags 1 to 4

det constant

specify(type=symmetric) 1.0

end(system)

estimate

@mixvar(define=saleseq,numlags=2) sales

# orddurgd ipelect prime

forecast(model=national+saleseq,results=forecasts,$

steps=6,from=2007:3)

Dealing with Seasonality

Handling variables that exhibit strong seasonality can be somewhat tricky. There are several methods available, but none of them is guaranteed to work because seasonal effects can have very different forms.

•You can include seasonal dummies in the system using the DETERM instruction.

•You can model the VAR using seasonal differences (computed using DIFFERENCE), and use MODIFY and VREPLACE to rewrite the model prior to forecasting.

Including a long lag length allows the estimated coefficients to pick up the seasonal effect. However, you cannot use the DECAY factor with this method, since it will dampen the values of the important seasonal lags.

There are some types of non-diagonal priors (ones which don’t put independent distributions on the lags) which might help with this. These would mainly be implemented using “dummy observations”.