Statistics and Algorithms / State Space Models /

State Space Models: Parameter Estimation

In a state space model, the Kalman filter gives us the predictive density for ${\bf{Y}}_t $ given $\left\{ {{\bf{Y}}_1 , \ldots ,{\bf{Y}}_{t - 1} } \right\}$ as

\begin{equation} {\bf{Y}}_t \sim N\left( {\mu _t + {\bf{C}}_t^\prime {\bf{X}}_{t|t - 1} ,{\bf{C}}_t^\prime \Sigma _{t|t - 1} {\bf{C}}_t + {\bf{N}}_t } \right) \label{eq:ssm_kfpredictivedensity} \end{equation}

We can use this with the standard time-series trick of factoring by sequential conditionals to get the full sample likelihood:

\begin{equation} p\left( {{\bf{Y}}_1 , \ldots ,{\bf{Y}}_T } \right) = p\left( {{\bf{Y}}_1 } \right)p\left( {{\bf{Y}}_2 |{\bf{Y}}_1 } \right)p\left( {{\bf{Y}}_3 |{\bf{Y}}_1 ,{\bf{Y}}_2 } \right) \ldots p\left( {{\bf{Y}}_T |{\bf{Y}}_1 ,{\bf{Y}}_2 , \ldots ,{\bf{Y}}_{T - 1} } \right) \end{equation}

The log likelihood function for the full sample is thus (omitting additive constants, although those will be included in the values reported by DLM)

\begin{equation} \begin{array}{l} - \left( {{1 \mathord{\left/ {\vphantom {1 2}} \right. } 2}} \right)\sum\limits_t {\log \left| {{\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + {\bf{N}}_t } \right|} - \\ ({1 \mathord{\left/ {\vphantom {1 2}} \right. } 2})\sum\limits_t {\left( {{\bf{Y}}_t - \mu _t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)^\prime \left( {{\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + {\bf{N}}_t } \right)^{ - 1} \left( {{\bf{Y}}_t - \mu _t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)} \\ \end{array} \end{equation}

This is the objective function used for Kalman filtering (and smoothing). Whether you estimate parameters or not, DLM will set the variables %LOGL and %FUNCVAL to this or a similar value as discussed later.

To estimate a model, use NONLIN to set up a PARMSET with the free parameters you want to estimate. On DLM itself, you have a choice of several estimation methods. Of those, only BFGS and GAUSSNEWTON can compute standard errors for the coefficients, and GAUSSNEWTON can only be applied if you have just one observable. The derivative-free methods (SIMPLEX, GENETIC, ANNEALING and GA) are generally used only as preliminary methods (PMETHOD option) to refine initial guesses.

Coming up with good guess values for state-space models can be difficult. The most important ones to get at least within an order of magnitude or so are the variances—the others are usually more robust to poor guess values. With a complicated model, it’s often a good idea to start with a more restricted submodel and “bootstrap” the estimates of the simpler model to start the complex model. For unobservable components, you might try doing a very crude estimate of a "cycle" by simple filtering operations and creating guess values from the behavior of that.

Variance Scale Factors

There are three variance terms in the model: $\Sigma _{\bf{W}}$, $\Sigma _{\bf{V}}$ and $\Sigma _{0|0}$. By default, DLM calculates the likelihood assuming that these are known (or being conditioned upon). However, as with linear regressions, it’s more common that the variance in the measurement equation isn’t known. The simplest way to handle this is to assume that all three variance (matrices) are known up to a single (unknown) constant multiplier—this is often a reasonable assumption since, if $\bf{C}$ is fixed, the scale of the $\bf{X}$’s will have to adjust to match the scale of $\bf{Y}$ anyway.

There are two options available for estimating this multiplier: it can be concentrated out, or it can be given a prior distribution. This is controlled by the option VARIANCE. This scale factor is usually, though not always, the unknown variance in the measurement equation, hence the option name. VARIANCE=CONCENTRATED requests that this value be concentrated out, while VARIANCE=CHISQUARED gives it an (inverse) chi-squared prior. If, as is typical, this is the variance in the measurement equation, you use the option SV=1. Note well: if you choose either one of these, all the estimated variance matrices will need to be scaled by the estimated variance. The default for the VARIANCE option is VARIANCE=KNOWN, which means that all the variances are to be used as given.

With VARIANCE=CONCENTRATED, if we write ${\bf{N}}_t = \lambda n_t $, the log likelihood element can be rewritten as (for simplicity, omitting $\mu_t$)

\begin{equation} \frac{{ - 1}}{2}\sum\limits_t {\log \left| {{\bf{C'}}_t \left( {\lambda \Sigma _{t|t - 1} } \right){\bf{C}}_t + \lambda n_t } \right| - \frac{1}{2}\sum\limits_t {\left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)^\prime \left( {{\bf{C'}}_t\ \left( {\lambda \Sigma _{t|t - 1} } \right){\bf{C}}_t + \lambda n_t } \right)^{ - 1} \left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)} } \end{equation}

where $\lambda$ is that unknown scale factor. $\lambda$ can be isolated to give

\begin{equation} \frac{{ - rank}}{2}\log \lambda - \frac{1}{{2\lambda }}\sum\limits_t {\frac{{\left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)^\prime \left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)}}{{\left( {{\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + n_t } \right)}}} - {\text{ terms without }}\lambda \label{eq:ssm_kfconcentratedscale} \end{equation}

where rank is the sum of the ranks of the ${\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + n_t $. This will just be the number of observations if you have one observable and no missing values. The maximizing value for $\lambda$ is

\begin{equation} \hat \lambda = \frac{1}{{rank}}\sum\limits_t {\left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)^\prime \left( {{\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + n_t } \right)^{ - 1} \left( {{\bf{Y}}_t - {\bf{C'}}_t {\bf{X}}_{t|t - 1} } \right)} \end{equation}

DLM sets the variable %VARIANCE to this value. Substituting this in gives the concentrated likelihood (again, omitting additive constants)

\begin{equation} - \frac{{rank}}{2}\log \hat \lambda - \frac{{rank}}{2} - \frac{1}{2}\sum\limits_t {\log \left| {{\bf{C'}}_t \Sigma _{t|t - 1} {\bf{C}}_t + n_t } \right|} \end{equation}

which is what DLM will maximize if you are estimating parameters. You should get the same results by estimating all the variances directly. However, the likelihood function is often very flat in the common scale factor, so the full set of variances may be close to being perfectly correlated. If this happens, you can run into problems with the optimization routine stalling out short of the maximum.

With VARIANCE=CHISQUARED, you need to provide two pieces of information in order to complete the prior: the “mean” (PSCALE=prior scale) and the degrees of freedom (PDF=prior degrees of freedom). When you Kalman filter through the data, the estimates of the variance will also be updated. If you need to track that, you can use the options SIGHISTORY=series of estimated variances and DFHISTORY=series of degrees of freedom of variance estimates. When you use VARIANCE=CHISQUARED, the conditional distribution of $\bf{Y}$ is a $t$, not a Normal, and the likelihood is computed using the $t$ density. See West and Harrison (1997) if you’re interested in this approach.

Estimating Variances

In many models, the only unknown parameters are the variances of the components. It seems straightforward to estimate these by maximizing the log likelihood. What complicates this is that there is nothing in the likelihood itself that constrains the individual variances to be non-negative. The only requirement for evaluating the log likelihood element at $t$ is that the predictive covariance matrix from \eqref{eq:ssm_kfpredictivedensity} be positive-definite. That means that $\bf{N}_t$ could be negative if the first term is positive enough and vice versa. And, unfortunately, this problem comes up all too often.

There are two main approaches for dealing with this. The first is to estimate the model unconstrained; if a variance comes in negative, zero it out and re-estimate. If that works, it should be the global maximum. If you have enough components that two variances come in negative, it’s not as clear how to proceed. In that case, it’s probably best to use RATS’ ability to put inequality constraints on parameters.

The other method is to parameterize the variances in logs. Although that doesn’t allow a true zero value, a variance which really should be zero will show up as something like -30 in logs.

There’s one other numerical problem that can come up when you estimate the variances directly. Even non-zero variances can sometimes be very small. It can be very hard for numerical optimizers to deal with parameters that have such naturally small scales—that is, it can be very hard to distinguish between a parameter which is naturally small and one that is small because it’s really supposed to be zero. You can sometimes fix this fairly easily by just multiplying the data (after such things as log transformation) by 100. That increases all the component variances by a factor of 10000, which is usually enough to fix this problem. Parameterizing the variances in logs will also take care of this type of problem. Concentrating out a variance also can help, since it often makes the only really small variance the one that is estimated constructively using the expression in \eqref{eq:ssm_kfconcentratedscale}.

Examples

This estimates the MA(1) model in the form shown below. We’ll concentrate out the variance of $\varepsilon$, so SW is 1. (SV is zero, since there is no measurement error in this form).

\begin{equation} {\bf{X}}_t = \left[ {\begin{array}{*{20}c} {\varepsilon _t } \\ {\varepsilon _{t - 1} } \\ \end{array}} \right],\,\,{\bf{A}} = \left[ {\begin{array}{*{20}c} 0 & 0 \\ 1 & 0 \\ \end{array}} \right],\,\,F = \left[ {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array}} \right],{\bf{w}}_t = \left[ {\varepsilon _t } \right],\,\,{\bf{C'}} = \left[ {\begin{array}{*{20}c} 1 & \theta \\ \end{array}} \right],\,\,{\bf{v}}_t = 0 \end{equation}

nonlin theta

compute a=||0.0,0.0|1.0,0.0||

compute f=%unitv(2,1)

compute theta=0.0

dlm(a=a,f=f,c=||1.0,theta||,y=ldiff,sw=1.0,var=concentrate,$

presample=ergodic,method=gauss) 1959:2 2008:4

The following are other examples of estimation of state-space models. In addition to these, the Commandeur-Koopman (easiest) and Durbin-Koopman and Harvey textbooks (harder) have state-space models as their main emphasis. The West-Harrison book also uses state-space models, though it is more designed to deal with using state-space models with fixed parameters for forecasting purposes.

HPFILTER.RPF	The well-known Hodrick-Prescott filter is a special case of local trend model with a fixed ratio of variances. This looks at the behavior of the model when the ratio of variances is allowed to be estimated.
SV.RPF	This estimates a stochastic volatility model using (approximate) state-space methods.
DLMCYCLE.RPF	This estimates a common cycle model with multiple observables and stationary data.
BIVARIATEHP.RPF	This estimates a bivariate version of the Hodrick-Prescott filter (and can be extended to more than two variables). This also has multiple observables, but, unlike DLMCYCLE.RPF, has the added complicated of being a non-stationary model.
EHLJME2000.RPF	Larger (14 equation) non-linear model with expectational terms. From Erceg, Henderson & Levin(2000).