Spurious Regression

The term "spurious regression" was coined by Granger and Newbold(1974) to describe how apparently statistically "significant" results could occur when regressing one random walk on another independent random walk (so the true regression coefficient is zero). This had actually been observed almost 50 years earlier in Yule(1926) who had done regressions on random walks generated by flipping coins (in what may have been one of the first examples of a Monte Carlo analysis).

The results in Granger and Newbold show that if you run

\begin{equation} {y_t} = \alpha + \beta {x_t} + {u_t} \label{eq:spurious_reg} \end{equation}

where \(x\) and \(y\) are independently generated random walks, the t-statistic on \(\beta\) does not converge to zero as the sample size gets large as it would if \(x\) and \(y\) were independent stationary processes, and, in fact, there is a non-trivial probability that you would conclude, even with a very large amount of data, that \(x\) and \(y\) were related, even though they aren't.

Now the important thing to note is that this regression, as it is run, is dynamically misspecified—under the true value, the residual process is a random walk. The famous rule of thumb from Granger and Newbold is that if the \(R^2\) is bigger than the Durbin-Watson (which will be near zero if the residuals follow a random walk), you may have a spurious regression.

It's important not to be confused by this result into thinking that you can't run a regression with \(I(1)\) variables on both sides. Nothing can be further from the truth. First, if the series are cointegrated, then the \(y\) and \(x\) regression isn't spurious—if that's the case, not only is the estimate of \(\beta\) not spurious, but it's "superconsistent", converging to the true value faster (as a function of sample size) than would be the case if the \(y\) and \(x\) were stationary. However, you can't tell the difference between these two cases without digging deeper into the dynamics. If the series are cointegrated, the residuals, while most likely serially correlated, aren't a random walk. In the true spurious regression case, the Durbin-Watson can be .2 or even lower. The Engle-Granger test (@EGTEST) tests for cointegration by running a (potentially) spurious regression and testing the residuals for a unit root.

Second, and perhaps, more important, dynamic regressions (those with lags) don't generally suffer from this type of problem. For instance, vector autoregressions and ARDL (autoregressive-distributed lag) models are generally fine even if the variables involved are \(I(1)\). The lags allow for the model to "difference" out the \(I(1)\) effects. For instance, if instead of the original static regression, we run the related ARDL model:

\begin{equation} {y_t} = \rho {y_{t - 1}} + \alpha + {\beta _0}{x_t} + {\beta _1}{x_{t - 1}} + {u_t} \label{eq:spurious_ardl} \end{equation}

we would expect an estimate of \(\rho\) very close to 1, \(\alpha\) and the \(\beta\)'s close to zero, and residuals that are very nearly serially uncorrelated. Now the standard t-statistic for \(\rho = 1\) will have a non-standard distribution, and a joint test for both \(\beta\)'s being zero will similarly be non-standard (see Sims-Stock-Watson) but they will be consistently estimated, which doesn't happen with the spurious regression; in other words, if you don't need to test for specific values, you can safely run the regression.

Of course, an ARDL or VAR is very different from a static regression and the coefficients mean completely different things. The "textbook" description of how to estimate a price elasticity is to run a regression like

\begin{equation} \log {q_t} = \alpha + \eta \log {p_t} + {u_t} \label{eq:spurious_elasticity} \end{equation}

which is a static regression. If \(\log q\) and \(\log p\) are \(I(1)\), we run into the possibility that the regression will be spurious. So what can we do? The first thing is to see if you can figure out why \(\log q\) and \(\log p\) are \(I(1)\) and possibly adjust for that. If this is a time series, and \(\log q\) is increasing at least in part because of increasing population, that has nothing to do with price signals—replacing it with quantity per capita (or perhaps as a percentage of GDP) may render it stationary, and will certainly bring it closer to what the true definition of elasticity is. Similarly, if \(\log p\) is \(I(1)\) because of overall inflation, deflating by an overall price index may render it stationary and again, make the regression produce a truer estimate of the price elasticity. In other words, \eqref{eq:spurious_elasticity} may be not just statistically spurious, but economically spurious as well. Even if we eliminate the \(I(1)\)-ness of the variables, it's still possible for the static regression \eqref{eq:spurious_elasticity} to not adequately capture the situation. If we instead estimate the dynamic regression

\begin{equation} \log {q_t} = \alpha + \rho \log {q_{t - 1}} + \beta \log {p_t} + {u_t} \label{eq:spurious_dynamic} \end{equation}

then there really isn't a single elasticity measure. The initial or impact elasticity is \(\beta\), but if there's a permanent price change, the long-run (limit)elasticity is

\begin{equation} \beta /(1 - \rho ) \label{eq:spurious_longrun} \end{equation}

which may make more sense economically if it takes time to adjust to new prices. Now that won't work if \(\rho = 1\), which is why it's still important to rid the equation of unit roots that have nothing to do with the reaction to pricing.