Statistics and Algorithms / Structural Breaks and Switching Models /

Markov Switching Models

Markov switching models for time series have become quite popular, largely as a result of the work of James Hamilton (see Chapter 22 of Hamilton, 1994). These combine a switching model (with unobservable regimes) for the description of data, with a Markov chain for the regimes.

While popular, these models are quite demanding technically. Before you decide to employ one, make sure that you understand how these work. The regimes are hidden, latent, unobservable; whatever word you want to describe them, there is and cannot be any formula which can tell you why period $t$ is (apparently) in a particular regime. You can try to interpret the results, but in many cases (particularly in models with many switching parameters) there is no clear way to describe the differences. If you think you need multiple regimes, but have a good idea about what determines whether the data are in one regime or the other, you may be better off with a “hard” threshold model such as a TAR or STAR.

We’ll start with the simpler case of the non-Markov switching model (called a mixture model) to introduce many of the concepts.

Mixture Models

Suppose that we have a series $y$ which can be in one of two possible (unobservable) regimes at each time period. (We will use the word “regime” rather than “state” to avoid possible confusion with state-space model.) We’ll use $S_t$ to represent the regime. The likelihood function when $S_t = i$ is written in the very general form $f_{(i)} \left( {y_t |X_t ,\Theta } \right)$ , where $X_t $ are exogenous and $\theta$ are parameters. Most often, the two regimes are linear regressions with different coefficients, but it’s also possible to apply this with state space or ARCH models. The unconditional probability for $S_t = 1$ is the (unknown) $p$. The log likelihood element for observation $t$ is

\begin{equation} \log \left\{ {pf_{(1)} \left( {y_t |X_t ,\Theta } \right) + \left( {1 - p} \right)f_{(2)} \left( {y_t |X_t ,\Theta } \right)} \right\} \label{eq:breaks_mixtureLogL} \end{equation}

Each likelihood element is a probability-weighted average of the likelihoods in the two regimes. This produces a sample likelihood which can show very bad behavior, such as multiple peaks. In the most common case where the two regimes have the same structure but with different parameter vectors, the regimes become interchangeable. Unlike the case with the threshold models where you design the regimes to be the one left of the break and the one right of it, the “labeling” of the two regimes isn’t defined by the model itself.

This model is not particularly useful by itself, as the regimes are independent across time. As such, it’s more often used in modelling the residual process in a more complicated model. However, the estimation methods carry over to more realistic models.

There are three main ways to estimate a model like this: conventional maximum likelihood, expectation-maximization (EM), and Bayesian Markov Chain Monte Carlo (MCMC). These have in common the need for the values for $f_{(i)} \left( {y_t |X_t ,\Theta } \right)$. Our suggestion is that you create a FUNCTION to do this calculation, which will make it easier to make changes. The return value of the FUNCTION should be a VECTOR with size equal to the number of regimes. Remember that these are the likelihoods, not the log likelihoods. If it’s more convenient doing the calculation in log form, make sure that you exp the results before you return. As an example:

function RegimeF time

type vector RegimeF

type integer time

local integer i

dim RegimeF(2)

ewise RegimeF(i)=exp(%logdensity(sigsq,%eqnrvalue(xeq,time,phi(i))))

end

Maximum Likelihood

The log likelihood function is just the sum of \eqref{eq:breaks_mixtureLogL} across observations. The parameters can be estimated by MAXIMIZE. There are two main issues: first, as mentioned above, the two regimes aren’t really defined by the model if the density functions take the same form. Generic guess values thus won’t work—you have to force a separation by giving different guess values to the two branches. Also, there’s nothing in the log likelihood function that forces $p$ to remain in the interval $[0,1]$. With two regimes, that usually doesn’t prove to be a problem, but constraining the $p$ parameter might be necessary in more complicated models. The usual way of doing that is to model it as a logistic with $p = 1 - \left( {1 + \exp (\theta )} \right)^{ - 1} $.

Maximum likelihood is always the simplest of these to set up. However, in most cases, it’s the slowest (unless you use a very large number of draws on the Bayesian method). The best idea is generally to try maximum likelihood first, and go to the extra trouble of EM only if the slow speed is a problem.

EM is able to simplify the calculations quite a bit in these models. The augmenting parameters are the regimes. Instead of \eqref{eq:breaks_mixtureLogL}, we need to maximize in the M step the simpler

\begin{equation} \mathop E\limits_{\left\{ {S_t |\Theta _0 } \right\}} \,\log f\left( {y_t ,S_t |X_t ,\Theta } \right) = \mathop E\limits_{\left\{ {S_t |\Theta _0 } \right\}} \,\left\{ {\log f\left( {y_t |S_t ,X_t ,\Theta } \right) + \log \,f\left( {S_t |X_t ,\Theta } \right)} \right\} \end{equation}

For the typical case where the underlying models are linear regressions, the two terms on the right don’t interact. Maximizing the sum of the first just requires a probability-weighted linear regression; while maximizing the second estimates $p$ as the average of the probabilities of $S_t=1$. $p$ is constructed to be in the proper range, so we don’t have to worry about that as we might with maximum likelihood. The value of the E step is that it allows us to work with sums of the more convenient log likelihoods rather than logs of the sums of the likelihoods as in \eqref{eq:breaks_mixtureLogL}.

The probabilities of the two regimes used in the M step are not the unconditional $p$ and $1-p$, but the estimated probabilities using the data and the previous parameter values. The E step requires computing those probabilities of the regimes. That can be done quite simply by Bayes rule:

\begin{equation} P\left( {S_t = 1|y_t ,X_t ,\Theta } \right) = \frac{{pf_{\left( 1 \right)} \left( {y_t |X_t ,\Theta } \right)}}{{pf_{\left( 1 \right)} \left( {y_t |X_t ,\Theta } \right) + \left( {1 - p} \right)f_{\left( 2 \right)} \left( {y_t |X_t ,\Theta } \right)}} \label{eq:breaks_EStepForP} \end{equation}

The application of EM to this problem just requires repeating those two steps. As is typical of EM, it can get estimates close to the likelihood maximizers fairly quickly. However, because it doesn’t estimate the $p$ and $\Theta$ jointly, the progress of the iterations slows down near the optimum, and it also can’t get a full covariance matrix of the estimates including both $p$ and $\Theta$. A useful strategy with larger models where maximum likelihood is too slow is to use EM for a certain number of iterations, then switch over to maximum likelihood to “polish” the estimates.

Bayesian MCMC

As with EM, the regime variables are now treated as parameters. However, now we draw values for them as part of a Gibbs sampler. The repeated steps are (in some order):

1.Draw $\Theta$ given $S_t$ and $p$

2.Draw $S_t$ given $\Theta$ and $p$

3.Draw $p$ given $S_t$ ($\Theta$ doesn't enter that)

Given specific values for $S_t$, the $\Theta$ are computed using standard regressions across the subsamples determined by the values of $S_t$. Given $\Theta$, the data information about $P\left( {S_t = 1} \right)$ is given by \eqref{eq:breaks_EStepForP}. That’s a draw from a Bernoulli which can be done with %RANBRANCH. Finally, $p$ can be drawn from a beta distribution (with %RANBETA) using the counts of the number of observations currently in each regime, combined with a (weak) prior to prevent p from collapsing to 0 or 1.

As with EM the $p$ must stay in range by construction. However, the “re-labeling” problem is a major concern here—the sampler can switch the interpretations of the regimes (sometimes many times). When the mixture model is being used for the residual process only, that is of little concern, but when it’s the actual model, it will be.

Markov Chains

As mentioned above, a model where the regimes are independent across time will generally be unrealistic for the process itself, and is generally used only for modelling residuals. The Markov Chain offers an alternative where the regime is still unobservable, but dependent on previous regimes. If we’re in regime 1, the next period we can either stay in 1, or move to 2. Let’s call the probability of staying $p_{11}$; the probability of moving thus is $1-p_{11}$. If we’re in regime 2, we could move to 1 or stay in 2. Let’s call the probability of moving $p_{12}$. The probability of staying is $1-p_{12}$. This is describing an invariant chain, with transition probabilities that are the same at all time periods. That can be relaxed, generally with only minor changes.

Note that papers and books which concentrate only models with two regimes generally use $p_{11}$ and $p_{22}$ as the two free parameters, and derive the probabilities of switching regimes from those. However, the parameterization that we are choosing generalizes easily to more than two regimes.

As before, we will have $f_{(i)} \left( {y_t |X_t ,\Theta } \right)$ giving the likelihoods at $t$ conditional on the regime. Let $p_{t - 1|t - 1} $ represent the probability of regime 1 at time $t-1$ given data through $t-1$. Applying the transitions, the probability of regime 1 at $t$ given data through $t-1$ is $p_{t|t - 1} = p_{11} p_{t - 1|t - 1} + p_{12} \left( {1 - p_{t - 1|t - 1} } \right)$. This is the prediction step. The log likelihood is now computed as in the simpler model, except with the sequential estimate of the probability (compare with \eqref{eq:breaks_mixtureLogL}):

\begin{equation} \log \left\{ {p_{t|t - 1} f_{\left( 1 \right)} \left( {y_t |X_t ,\Theta } \right) + \left( {1 - p_{t|t - 1} } \right)f_{\left( 2 \right)} \left( {y_t |X_t ,\Theta } \right)} \right\} \end{equation}

Next is the update step, which computes the probability given data through $t$:

\begin{equation} p_{t|t} = \frac{{p_{t|t - 1} f_{\left( 1 \right)} \left( {y_t |X_t ,\Theta } \right)}}{{p_{t|t - 1} f_{\left( 1 \right)} \left( {y_t |X_t ,\Theta } \right) + \left( {1 - p_{t|t - 1} } \right)f_{\left( 2 \right)} \left( {y_t |X_t ,\Theta } \right)}} \end{equation}

This filtering procedure for computing the log likelihood works because of the properties of conditional likelihoods. The one remaining problem is what to do with $t=1$. What is the meaning of $p_{0|0} $, which is needed to compute the likelihood for $t=1$? There are two common ways to handle this. The first is to compute the ergodic or stationary probability of the chain. As long as all the transition probabilities are non-zero, there is a unique probability that doesn’t change when you apply the transitions to it. That can be computed fairly easily. The alternative is to add the pre-sample probability to the parameter set. It turns out that using the ergodic probability is simplest if you use maximum likelihood and the parameterized probability is simplest if you use EM.

@MSSETUP functions

Computing a log likelihood element for a Markov switching model takes three steps: the prediction step, the calculation of the log likelihood, then the update step to get the probabilities for use in the next entry. And the formulas above are specific to the two regime model. If you move to three regimes, the number of free parameters in the transition matrix goes up to six, and with four, to twelve. To allow for more than two regimes, it makes sense to switch to matrix operations and break the calculations into manageable pieces. We now describe a set of functions to do this that are provided on the procedure file MSSETUP.SRC (and are pulled in with a call to @MSSETUP or one of the more specific setup procedures).

The representation for the transition matrix which simplifies the calculations most is an $\left( {n - 1} \right) \times n$ matrix, where ${\bf{P}}\left( {i,j} \right)$ is the probability of moving to regime $i$ from regime $j$. The missing ${\bf{P}}\left( {n,j} \right)$ is just one minus the sum of the other elements in column $j$. (The transition matrix is sometimes written at the transpose of this—with the target regime in the column and source regime in the row. That convention requires post-multiplying by the transition matrix, rather than premultiplying with the convention used here.)

In the following, we will use the VECTOR p* to represent the most “current” estimate of the probabilities of the regimes.

%MCSTATE(P,PSTAR) takes such a transition matrix, combined with p*, and returns the VECTOR of predicted probabilities.

%MSUPDATE(F,PSTAR,FPT) returns the VECTOR of updated regime probabilities given the vector of likelihoods (levels, not logs) of the regimes in F and the predicted probabilities of the regimes in the vector PSTAR. FPT is a REAL returned by the function which gives ${\bf{F}} \bullet {\bf{p}}^ * $. If F is a vector of likelihoods given the regimes, FPT will be the overall likelihood for this entry so the log of it will be the log likelihood element.

%MCERGODIC(P) takes the transition probability matrix (parameterized as described above) and returns as a VECTOR the ergodic (stationary) distributions of the regimes. In most applications, this will be the appropriate initialization for p*. To feed this into the PSTAR vector, use the option START=(PSTAR=%MCERGODIC(p)) on the MAXIMIZE which estimates the model.

Two other functions on the file are used for handling a parameterization of $P$ using logistic indexes.

%MSLOGISTICP(THETA) returns the result from transforming an $\left( {n - 1} \right) \times n$ matrix of logistic indexes into an $\left( {n - 1} \right) \times n$ matrix of transition probabilities.

%MSPLOGISTIC(P) is the inverse of %MSLOGISTICP—it takes the $\left( {n - 1} \right) \times n$ matrix of transition probabilities and returns the implied logistic indexes. That mapping, which takes the unbounded $\theta _{ij} $ to the bounded $p_{ij} $, is:

\begin{equation} p_{ij} = \frac{{\exp \left( {\theta _{ij} } \right)}}{{1 + \exp \left( {\theta _{1j} } \right) + ... + \exp \left( {\theta _{n - 1,j} } \right)}} \end{equation}

Using these functions, the basic setup for maximum likelihood estimation for a Markov switching model is:

compute n=2

dec rect p(n-1,n)

dec vect f(n) pstar(n)

nonlin p a01 a02 a11 a12 sigma

frml markov = f=SimpleRegimeF(t),$

pstar=%msupdate(f,%mcstate(p,pstar),fpt),log(fpt)

maximize(start=(pstar=%mcergodic(p))) markov start end

The final line in the definition of the MARKOV formula will look the same regardless of the structure of the model. It’s the definition of the “F” vector that changes from model to model.

While these three functions are often useful, the calculations for some models don’t easily fit into this framework. For instance, the model in HAMILTON.RPF has a large set of regimes with a very sparse (that is, mostly zero) transition matrix, so it does the same types of calculations but uses specialized methods to avoid a lot of multiplies by zero.

Regime Probability Smoothing

The sequence of $p_{t|t} $ gives the probability that the data are in regime 1 at time $t$ given information only through time $t$. The smoothed estimates of the probabilities are the computed using all the data. Computing these requires keeping the entire history of both the filtered $p_{t|t} $ and predicted $p_{t|{t-1}}$ probabilities, then doing an extra sweep through the data (in reverse) at the end. See Hamilton (1994) for technical details.

To do this, add the following to the set up:

dec series[vect] pt_t pt_t1 psmooth

gset pt_t = %zeros(nstates,1)

gset pt_t1 = %zeros(nstates,1)

gset psmooth = %zeros(nstates,1)

and adjust the MARKOV formula to

frml markov = f=SimpleRegimeF(t),$

pt_t1=%mcstate(p,pstar),$

pt_t=pstar=%msupdate(f,pt_t1,fpt),log(fpt)

and add

@%mssmooth p pt_t pt_t1 psmooth

after the model has been estimated. %MSSMOOTH is also on MSSETUP.SRC.

Models with Lags

The model which we’ve been using so far has an observation equation which depends upon the current regime, not lagged regimes—the only dynamic connection among regimes comes through the Markov switching process for them. The analysis gets more complicated if the observation equation also depends upon lagged regimes. There have been several proposals for Markov switching models for GDP growth (or, more generally for Markov switching VAR’s) in increasing order of complexity:

\begin{equation} x_t = \alpha \left( {S_t } \right) + \varphi _1 x_{t - 1} + \ldots + \varphi _p x_{t - p} + u_t \label{eq:breaks_interceptswitchvar} \end{equation}

\begin{equation} x_t = \alpha \left( {S_t } \right) + \varphi _1 (S_t )x_{t - 1} + \ldots + \varphi _p (S_t )x_{t - p} + u_t \label{eq:breaks_fullswitchvar} \end{equation}

\begin{equation} x_t - \mu \left( {S_t } \right) = \alpha + \varphi _1 \left( {x_{t - 1} - \mu (S_{t - 1} )} \right) + \ldots + \varphi _p \left( {x_{t - p} - \mu (S_{t - p} )} \right) + u_t \label{eq:breaks_hamiltonmodel} \end{equation}

\begin{equation} x_t - \mu \left( {S_t } \right) = \varphi _1 \left( {x_{t - 1} - \mu (S_{t - 1} )} \right) + \ldots + \varphi _p \left( {x_{t - p} - \mu (S_{t - p} )} \right) + (1 - L)u_t \label{eq:breaks_lammodel} \end{equation}

In \eqref{eq:breaks_interceptswitchvar} and \eqref{eq:breaks_fullswitchvar} the autoregression changes with the regime. In \eqref{eq:breaks_interceptswitchvar}, only the intercept changes, while in \eqref{eq:breaks_fullswitchvar}, it’s the entire equation. Either of these can be handled using the methods described so far. The StateF function just needs to compute the residual and likelihood in each regime.

\eqref{eq:breaks_hamiltonmodel} is the Hamilton switching model. With $n=2$, there are high mean growth and low mean growth periods with a regime-invariant autoregression linking them to the data. The measurement equation depends upon current and $p$ lagged values of the regimes. This is most easily handled by expanding the number of “regimes” to $n^{p+1}$ to cover all the combinations that could occur in \eqref{eq:breaks_hamiltonmodel}. The augmented regimes at time $t$ have the underlying regimes at $t,t - 1,...,t - p)$. The transition probabilities between most of the augmented regimes is zero, since we can only move from $(S_t ,S_{t - 1} ,...,S_{t - p} )$ to $(S_{t+1} ,S_t ,...,S_{t - p + 1} )$ if $S_t ,...,S_{t - p + 1} $ remain the same.

\eqref{eq:breaks_lammodel} is the model from Lam (1990). While apparently trivially different from \eqref{eq:breaks_hamiltonmodel} (with the $1-L$ in the error term), that makes the likelihood at $t$ dependent upon the regimes stretching all the way back to the start of the data set since $u$ is unobservable. While it’s possible to estimate \eqref{eq:breaks_lammodel} by maximum likelihood, it’s quite a complicated process. This is an ideal situation for Gibbs sampling, since the model only needs to be computed for the single current draw for the history of the regimes, rather than being a probability average across regimes.

@MSVARSETUP

The procedure @MSVARSetup sets up a Markov switching VAR (or autoregression if there is only one variable). It creates matrices for the state transition probabilities (P), the means or intercepts (MU) and the lag coefficients (PHI) and defines several functions and formulas needed for estimating the model, both using maximum likelihood and EM.

@MSVARSetup( options )

# list of dependent variables

Options

lags=# of VAR lags [1]

states=# of states [2]

switch=[m]/i/mh/ih/c/ch

In each of the SWITCH choices, M indicates a Hamilton-type model where the mean of each series is regime-dependent, I indicates the intercept is regime-dependent, and C means that the entire coefficient vector switches—those are mutually exclusive. H indicates variances (covariance matrices for multivariate systems) are regime-dependent.

It should be noted that great care is required to estimate a model with switching variances using maximum likelihood (or EM) as the likelihood function is unbounded. To see why, take the mixture model with both means and variances switching. The log likelihood for observation $t$ is

\begin{equation} \log \left\{ {pf_N \left( {x_t - \mu _1 ,\sigma _1^2 } \right) + \left( {1 - p} \right)f_N \left( {x_t - \mu _2 ,\sigma _2^2 } \right)} \right\} \end{equation}

where $f_N \left( {x,\sigma ^2 } \right)$ is the normal density at $x$ with variance $\sigma ^2$. If we take $\mu _1 = x_1 $ (we could peg it at any data value), the likelihood can be made arbitrarily high by making $\sigma _1 $ very close to 0. The other data points will, of course, have a zero density for the first term, but will have a non-zero value for the second, and so will have a finite log likelihood value. As a result, the likelihood function has many very high “spikes” around $\mu$ values equal to the data points.

This can be fixed by using a more complicated Bayesian estimation, since even a weak prior on the variances can exclude these very narrow regions. You can also add a REJECT option to prevent either variance from getting too small. In the case of a VAR, the REJECT test has to be a more complicated test for a nearly singular covariance matrix.

You also may find the results from allowing for switching variances to be disappointing. While you may hope that (for instance) SWITCH=MH will give you high and low mean regimes, but with different variances, the problem with latent regimes is that the data may favor high variance and low variance regimes and not really care much about the means—in fact, that tends to be the case. Markov models generally work best when the number of switching parameters is sharply constrained.

Example (Maximum Likelihood)

This sets up and estimates a one lag, two regime “Hamilton” (switching mean) VAR model by maximum likelihood. Note that you need to give an explicit estimation range. This uses the logistic parameterization for the likelihood, which requires also computing p=%mslogisticp(theta) during the START processing.

MU will be a VECTOR[VECTOR] with the outer vector for the regimes and inner for the variable. So MU(1)(5) is the mean in the first regime for the fifth variable (DCAN). PHI will be a VECT[RECT] with one matrix per lag. In this case (since there’s only one lag), there will just a single $6 \times 6$ matrix.

The %MSVARPROB(T) function does the whole set of calculations for computing the likelihood for the regimes, and doing the prediction and update steps on the probabilities. It returns the likelihood, so we just need to define a FRML to give its log.

@MSVARINITIAL does a standard set of guess values, aimed at making regime 1 the high mean and regime 2 the low mean regime. (The complete example is on the Krolzig MSVAR replication files as wbc_multivariate_pt1.rpf).

compute gstart=1962:1,gend=1991:4

@msvarsetup(lags=1,states=2)

# dusa djap dfrg duk dcan daus

nonlin(parmset=varparms) mu phi sigma

nonlin(parmset=msparms) theta

frml msvarf = log(%MSVARProb(t))

@msvarinitial gstart gend

maximize(parmset=varparms+msparms,$

start=%(p=%mslogisticp(theta),pstar=%MSVARInit()),$

method=bfgs,iters=400) msvarf gstart gend

Example (EM Estimation)

The M step of EM for the VAR coefficients (the means, lag coefficients and covariance matrix) is just a probability-weighted multivariate linear regression, with the probabilities being the smoothed estimates for each combination of regimes. The M step for the transition matrix requires smoothed estimates not just of the probabilities at $t$, but of the joint probabilities of all combinations of $\left\{ {S_t ,S_{t - 1} } \right\}$

in order to estimate the probabilities of moving from $S_{t-1}$ to $S_t$. This requires no extra calculations for the Hamilton mean switching model, since it already needs smoothed estimates of all combinations of regimes from $t$ to $t-p$. The MSVARSETUP file includes several support functions for EM estimation:

@MSVAREMGENERALSETUP sets up the work arrays needed and @MSVAREMSTEP does the combination of E and M steps. For the example above, this does 50 iterations of EM.

@msvarinitial gstart gend

@msvarEMgeneralsetup

do emits=1,50

@msvarEMstep gstart gend

?"Iteration" emits "Log Likelihood" %logl

end do emits

The likelihood functions used by ML and EM aren’t strictly comparable. As written, the ML likelihood function uses the ergodic initialization for the pre-sample probabilities. If you try that with EM, the (otherwise) relatively simple M step for the transition probabilities becomes quite complex. Instead, the natural way to handle the pre-sample probabilities with EM is to estimate them, which simply requires taking the probability smoother and running it backwards one extra period. It’s possible to add the pre-sample probabilities to the parameter set of ML, but that adds more parameters to an already often quite large set. It remains true that the best approach is generally to use EM up to a point, then switch to ML, but the final estimates won’t converge as quickly as they might in a model where the target likelihoods match.

@MSREGRESSION, @MSSYSREGRESSION, procedure groups

The @MSVARSETUP procedures are for the specific case of a VAR (or univariate AR) where the only variables allowed are the lagged dependent variables and the intercept (in some form). In many cases, however, that might be too restrictive. There are two other groups of procedures for more general models: @MSRegression is for univariate regressions while @MSSysRegression is for multivariate regressions (multiple equations with the same explanatory variables). The use of these is covered in detail as part of the Structural Breaks and Switching Models e-course.

Examples

In addition to the Krolzig MS-VAR's, examples offered for Markov Switching models include HAMILTON.RPF which does the original Hamilton model, and MSVARIANCES.RPF which does a Markov Switching model for variances. SWARCH.RPF, which does a Markov switching ARCH model is discussed in a separate section. The Kim and Nelson text starts out with a set of Markov Switching models (eventually working up to rather complicated Markov Switching State-Space Models).