Heteroscedasticity Testing

Starting with the basic linear regression model

\begin{equation} {\bf{y}} = {\bf{X}}\beta + {\bf{u}},E[u|{\bf{X}}] = 0 \label{eq:hypos_basemodel} \end{equation}

the null hypothesis in heteroscedasticity testing is

\begin{equation} E[{\bf{uu'}}] = \sigma ^2 {\bf{I}} \label{eq:hypos_homoscedasticity} \end{equation}

There are many alternatives ranging from simple (different variances over different periods) to elaborate (ARCH: autoregressive conditional heteroscedasticity). Several of these are common enough to have standard procedures that are supplied with RATS. In particular, the procedure @RegWhiteTest, applied immediately after a regression, can be used to do the Breusch-Pagan or White tests described here.

Most tests have two forms: an F-test and a LM form, which give similar, but not identical significance levels.

HETEROTEST.RPF example

The example file HETEROTEST.RPF is based upon a hedonic price regression for housing price, adapted from Wooldridge(2009). The three explanatory variables are the lot size, square footage and number of bedrooms. This works with a linear specification. (Wooldridge also looks at a semi-log model, which is probably more appropriate and doesn't show heteroscedasticity.) Because we keep re-estimating this (possibly over different samples), we define an EQUATION so we can use LINREG(EQUATION) to estimate:

equation linff price

# constant lotsize sqrft bdrms

linreg(equation=linff)

Breusch-Pagan Test

Breusch and Pagan (1979) describe a Lagrange Multiplier test against the very general alternative

\begin{equation} \sigma _t^2 = h\left( {z'_t {\kern 1pt} {\kern 1pt} \alpha } \right) \end{equation}

where $z_t$ is some set of variables, such as regressors. Under LM test principles, the function $h$ washes out of the test statistic, so the single test works simultaneously against all alternatives based upon $z$. We show here a slight modification to Breusch and Pagan, due to Koenker (1981), which does not require an assumption of Normal residuals. This does the two forms described above using the full set of explanatory variables as $z_t$:

set usq = %resids^2

linreg usq

# constant lotsize sqrft bdrms

exclude(title="Breusch-Pagan Test for Heteroscedasticity")

# lotsize sqrft bdrms

cdf(title="Breusch-Pagan Test, LM Form") chisqr %trsquared 3

The LM version can be done using the procedure @RegWhiteTest with the option TYPE=BP. You need to do this right after the regression that you want to test, so we re-estimate the model:

linreg(equation=linff)

@RegWhiteTest(type=bp)

Harvey’s Test

This is the same idea as Breusch and Pagan, but with $exp$ as the $h$ function The $exp$ function is a reasonable functional form in practice, because it produces positive values and, if the “z” variables are in log form, gives a power function for the variance.

This does Harvey’s test using only the log of lot size:

linreg(equation=linff)

set logusq = log(%resids^2)

set llotsize = log(lotsize)

linreg logusq

# constant llotsize

cdf(title="Harvey Test") chisqr %trsquared 1

White’s Test

White (1980) describes a test which has power against alternatives that affect the consistency of the least squares covariance matrix. It is a variation of the Breusch-Pagan test, where $z_t$ consists of the regressors, their squares and products. Because there can be duplication among the regressors and products of regressors, you may need to drop some of those. However, if you do end up with collinearity among the regressors, RATS will simply zero out the redundant variables, and reset the degrees of freedom of the regression. The calculation for the degrees of freedom in the CDF instruction below will give the correct test value if such an adjustment is made. However, because of the complexity of the setup for this test, we would recommend that you use the procedure @RegWhiteTest.

linreg(equation=linff)

set usq = %resids^2

set lotsq = lotsize^2

set sqrftsq = sqrft^2

set bdrmssq = bdrms^2

set lotxsqrft = lotsize*sqrft

set lotxbdrms = lotsize*bdrms

set sqftxbdrms = sqrft*bdrms

linreg usq

# constant lotsize sqrft bdrms $

lotsq sqrftsq bdrmssq lotxsqrft lotxbdrms sqftxbdrms

cdf(title=”White Heteroscedasticity Test”) chisqr $

%trsquared %nobs-%ndf-1

* Same thing done with @RegWhiteTest

linreg(equation=linff)

@RegWhiteTest

Goldfeld-Quandt Test

The Goldfeld-Quandt test has as its alternative hypothesis that one (identifiable) segment of the sample has a higher variance than another. The test statistic is computed by running the regression over the two subsamples and testing the ratio of the estimated variances.

In a cross section data set, the segments are usually determined by the values of one of the variables. In some cases, you will have zero/non-zero variables identifying the segments, and can use SMPL options to select the subsamples directly. Sections 8.3.3 and 8.4.2 of Hill, Griffiths, Lim (2008) look at the variance of wages for those living in metropolitan areas versus those in rural areas. The variable METRO is equal to 1 for city dwellers, and 0 for those in rural areas, so we can use SMPL options directly:

linreg(smpl=metro) wage

# constant educ exper

compute rssmetro=%rss,ndfmetro=%ndf

linreg(smpl=.not.metro) wage

# constant educ exper

compute rssrural=%rss,ndfrural=%ndf

compute gqstat=(rssmetro/ndfmetro)/(rssrural/ndfrural)

cdf(title="Goldfeld-Quandt Test") ftest gqstat ndfmetro ndfrural

In other cases, you will need to sort or rank the values of one of the regressors to identify the subsamples. We recommend using ORDER with the RANKS option for this. The following does that as part of HETEROTEST.RPF.

order(ranks=lotranks) lotsize

linreg(equation=linff,smpl=lotranks<=36)

compute rss1=%rss,ndf1=%ndf

linreg(equation=linff,smpl=lotranks>=53)

compute rss2=%rss,ndf2=%ndf

cdf(title="Goldfeld-Quandt Test") ftest $

(rss2/ndf2)/(rss1/ndf1) ndf2 ndf1

Note that if the alternative hypothesis is that the variances are simply different, not that a specific half is greater than the other, then either a very small or a very large F will lead to rejection of the hypothesis. For a 5% two-tailed F, reject if the significance level is either less than .025 or greater than .975.

Also, if you suspect that the variance is related to a continuous variable, breaking the sample into two parts means that each subsample will have some observations close to the break value—this will reduce the power of the test. In this situation, the usual advice is to have a third subsample in the middle which isn’t included in the test. The example above is excluding (roughly) 20% of the observations in the middle. Because ORDER gives tied entries an average rank, the exact number of elements in either of the subsets isn’t controlled by the number you choose. For instance, if three entries are tied for 35, 36 and 37, they will all be assigned rank=36, so there would actually be 37 data points in the first partition.

ARCH Test

ARCH (for AutoRegressive Conditional Heteroscedasticity) was proposed by Engle (1982) as a way to explain why large residuals tend to clump together. The model (for first order ARCH) is

\begin{equation} u_t \sim N\left( {0,\sigma ^2 \left( {1 + \alpha {\kern 1pt} {\kern 1pt} u_{t - 1}^2 } \right)} \right) \end{equation}

The test is to regress the squared residual series on its lag(s). This should have an $R^2$ of zero under the null hypothesis. This uses a different data set as testing for ARCH rarely makes sense in a cross section data set like that used before.

One error users sometimes make is applying a test for ARCH to the data rather than residuals. What goes into the test should be serially uncorrelated (as much as possible). In this example, the data just need to have the mean removed first:

diff(center) dlogdm / resids

set usq = resids^2

linreg usq

# constant usq{1}

cdf(title="Test for ARCH(1)") chisqr %trsquared 1

Testing for higher order ARCH just requires adding extra lags of USQ and increasing the degrees of freedom on the CDF instruction.

This can be done more simply using the @ARCHTEST procedure:

@archtest(lags=1,form=lm) resids

You can also do a single @ARCHTEST to get a sensitivity table with tests for different lags by using the SPAN option. This, for instance, does test for arch for lags 1 to 6.

@archtest(lags=6,form=lm,span=1) resids

It’s important to remember that if you reject the null, you are not concluding that an ARCH (or more likely GARCH) model will fit the data, just that the type of clustering of large residuals that is consistent with ARCH or GARCH is present.