Cautionary Note on Models with Switching Variances
Posted: Thu Jul 08, 2010 9:47 am
The MSVARSETUP procedure allows for either means or intercepts to switch (SWITCH=M, SWITCH=I), and also allows for the covariance matrices to switch. (SWITCH=MH switches means and (co)variances, while SWITCH=IH switches intercepts and covariances). Although we allow for the switching covariance matrices, you should note (since it seems often to be ignored) that this type of model with switching variances has very bad properties when estimated by (unconstrained) maximum likelihood.
To demonstrate this with a much simpler model, suppose y(t)~N(mu,sigma(t)^2) where sigma(t) is sigma1 if t is in state 1 and sigma2 if t is in state 2, with Prob(t is in state 1)=p. Then the likelihood for t is
p fN(y(t)|mu,sigma1^2) + (1-p) fN(y(t)|mu,sigma2^2)
where fN is the univariate Normal density. If mu=y(t), the value of this approaches infinity as sigma1-->0. In a model without mixing, this wouldn't be a problem, because the likelihoods at the other data points would go to zero just as fast. However, in a mixing model, all that happens is that the first term at the other data points goes to zero, but (as long as p<>1) the second term will be comfortably non-zero. Thus, the log likelihood function will have a large number of "spikes" in very small zones of mu and sigma. Maximum likelihood will only give "reasonable" results if it is lucky enough to avoid the "bad" patches, or if something is done to (somewhat arbitrarily) prevent the variances from getting too small. But any ML estimation on a multi-modal surface is suspect, and this surface has literally hundreds of modes. (And with a multivariate model, the same thing can happen when one of the covariance matrices goes singular, rather than having to go all the same to a zero matrix).
Bayesian methods don't have a problem with this because any prior on the variances will smooth the spikes out of the posterior.
To demonstrate this with a much simpler model, suppose y(t)~N(mu,sigma(t)^2) where sigma(t) is sigma1 if t is in state 1 and sigma2 if t is in state 2, with Prob(t is in state 1)=p. Then the likelihood for t is
p fN(y(t)|mu,sigma1^2) + (1-p) fN(y(t)|mu,sigma2^2)
where fN is the univariate Normal density. If mu=y(t), the value of this approaches infinity as sigma1-->0. In a model without mixing, this wouldn't be a problem, because the likelihoods at the other data points would go to zero just as fast. However, in a mixing model, all that happens is that the first term at the other data points goes to zero, but (as long as p<>1) the second term will be comfortably non-zero. Thus, the log likelihood function will have a large number of "spikes" in very small zones of mu and sigma. Maximum likelihood will only give "reasonable" results if it is lucky enough to avoid the "bad" patches, or if something is done to (somewhat arbitrarily) prevent the variances from getting too small. But any ML estimation on a multi-modal surface is suspect, and this surface has literally hundreds of modes. (And with a multivariate model, the same thing can happen when one of the covariance matrices goes singular, rather than having to go all the same to a zero matrix).
Bayesian methods don't have a problem with this because any prior on the variances will smooth the spikes out of the posterior.