Statistics and Algorithms / Panel and Grouped Data /

Statistical Methods

Analysis of Variance: the Instruction PSTATS

Does a single statistical model seem to unite the individuals (or time periods) in the data set, or not? The instruction PSTATS can be used to help answer that question. It performs an analysis of variance test for common means, across individuals, across time, or both. This can be applied to data series themselves, but is more typically applied to the residuals from a simple regression across the entire data set.

To do this, use PSTATS with the option TEST. The EFFECTS option allows you to choose a test for INDIVIDUAL effects, TIME effects, or BOTH. The following output, for instance, gives a marginal significance level on the test for individual effects of .3597. We would conclude that there really isn’t compelling evidence that the individual means differ across individuals.

Analysis of Variance for Series RESIDS

Source Sum of Squares Degrees Mean Square F-Statistic Signif Level

INDIV 1.5912296947623 39 .0408007614042 1.080 0.35971320

ERROR 6.0418415088589 160 .0377615094304

TOTAL 7.6330712036212 199

PSTATS will also test a series for equality of variances. To do this, use the SPREAD option with the appropriate EFFECTS option. (You can’t choose EFFECTS=BOTH for this).

Linear Regressions

When you have a linear regression model, there are a number of techniques which can exploit the structure of a panel data set. See, for instance, Baltagi (2008) for details. Because a panel data set has two or more observations per individual in the sample, it offers possibilities to reduce errors which would occur if only a simple cross section were available. However, it is necessary to make some assumptions which tie the model together across individuals.

Among the possibilities:

1.All coefficients are assumed to be constant across individuals (and time).

2.Some coefficients are assumed to be constant, while others differ.

3.The coefficients are assumed to differ, but are “similar,” so that large differences are implausible.

In addition, there can be assumptions about a link among the error terms:

1.The error terms can be assumed to be independent across individuals (or across time), but correlated within an individual’s record.

2.The error terms can be assumed to be correlated at a given point in time across individuals, but independent across time.

Even if an assumption is plausible for your data set, you need the proper amount of data to employ the technique. For instance, if you have only a few observations per individual, it will be difficult, if not impossible, to allow the coefficients to vary across individuals, since you don’t have enough data for any individual to tack down the coefficient estimates. Similarly, if you have many individuals, you may not be able to estimate freely a covariance matrix of error terms across individuals.

The most commonly employed techniques are the Fixed and Random effects estimators. Fixed effects allows intercepts to vary while keeping other coefficients fixed. Random effects is a related technique which works through assumptions on the error term. The Random Coefficients model (Swamy’s method) which allows all coefficients to vary (in a controlled fashion) is shown in example SWAMY.RPF. We’ll describe the others more briefly here.

Regressions with Dummy Variables

If you want to allow coefficients other than the intercepts to vary (freely) across individuals, you can create individual dummies times regressors for each varying regressor. (If everything is fixed except the intercepts, you can use a fixed effects estimator.) If you have a very large number of individuals, this can become impractical, as the regression can be too big to be handled.

Part of the PANEL.RPF example shows how to do a regression with dummy variables.

Heterogeneous Regressions

The instruction SWEEP with the option GROUP=%INDIV(t) can handle a variety of complicated operations based upon linear regressions in which the coefficients are allowed to vary (freely) across individuals. Among these are the calculations of means of the coefficient vectors (mean group estimators) and the extraction of residuals from “nuisance” variables.

sweep(group=%indiv(t),var=hetero)

# invest

# constant firmvalue cstock

does a “mean group” regression of INVEST on CONSTANT, FIRMVALUE and CSTOCK, producing in %BETA the average of the coefficients across individuals and in %XX the covariance matrix of the estimates allowing the variances to differ among individuals.

sweep(group=%indiv(t),series=tvar)

# dc lpc{1} lndi dp

# dy ddp constant

regresses each of the series DC, LPC{1}, LNDI and DP on DY, DDP and CONSTANT, with a separate regression for each individual, producing the VECT[SERIES] TVAR with the residuals from those regressions.

Autoregressive Errors: Using AR1

There are some specialized techniques which combine autoregressive errors with some other panel data methods. These aren’t available in RATS. If you run an AR1 instruction on a panel data set, you have two ways of handling the estimation of \(\rho\):

•You can assume that it is fixed across all cross-sections (default treatment).

•You can assume that it is different in each (use option DIFFERING).

We don’t recommend the use of DIFFERING unless you have many time periods per cross-section—the error introduced into the GLS procedure by using a large number of rather poorly estimated serial correlation coefficients can be very large.

If you use DIFFERING, AR1 does not iterate to convergence and leaves the \(\rho\)’s out of the covariance matrix of coefficients.

Robust Errors

With time series data, you can’t, in general, handle arbitrary and unknown patterns of serial correlation—the requirement for consistency is that the window width has to go to infinity slower than the number of data points, and to guarantee a positive definite covariance matrix, you need to use a lag window type (such as Newey-West) which further cuts the contribution of the longer lags. With panel data, however, you can rely on the “N” dimension to provide consistency. Assuming that “T” is relatively small, you can correct for arbitrary correlation patterns by using ROBUSTERRORS with the option LWINDOW=PANEL. This computes the following as the center term in the “sandwich” estimator, the positive definite:

\begin{equation} \sum\limits_i {\left( {\sum\limits_t {X_{it} u_{it} } } \right)^\prime \left( {\sum\limits_t {X_{it} u_{it} } } \right)} \end{equation}

For a similar calculation based upon grouping on some identification other than individual, use the CLUSTER option—that uses the same center term, but with "i" defined based upon the input clustering expresssion.

Seemingly Unrelated Regressions

Seemingly unrelated regression techniques are, in general, applied to panel data sets, though usually to “small N, large T” data sets. As implemented in RATS using the instruction SUR, these require a well-formed panel of type “B.” You can, however, estimate using a type “A” panel set, using the instruction PREGRESS (Panel Regress) with the option METHOD=SUR. This estimates the model

\begin{equation} y_{it} = X_{it} \beta + u_{it} \end{equation}

Note that the number of free parameters in the covariance matrix is \(N(N+1)/2\). If you have relatively short time series, you may quickly exhaust your ability to estimate an unconstrained covariance matrix if you include many individuals. An alternative which can be applied to a data set like that if \(N\) is too large to make SUR feasible is to estimate by least squares, but then compute Panel-Corrected Standard Errors. That can be done using the procedure @REGPCSE.

First Difference Regressions

If the model is

\begin{equation} y_{it} = X_{it} \beta + u_{it} ,u_{it} = \varepsilon _i + \eta _{it} \end{equation}

one approach for dealing with the individual effects (\(\varepsilon \)) is to first difference the data within each individual, since

\begin{equation} u_{i,t} - u_{i,t - 1} = \varepsilon _i + \eta _{i,t} - \varepsilon _i - \eta _{i,t - 1} = \eta _{i,t} - \eta _{i,t - 1} \end{equation}

While it’s possible to do this manually (with SET or DIFFERENCE), you can also run this using PREGRESS with the option METHOD=FD. Note that this probably will induce rather substantial serial correlation in the residuals which may create statistical problems in some cases, which is discussed in Instrumental Variables.

Instrumental Variables

Instrumental variables estimators are important in panel data work because, in many cases, correcting for individual effects will create correlation problems between the transformed errors and the transformed regressors. For instance, if the explanatory variables include lagged dependent variables, neither fixed effects nor first differencing will provide consistent estimates for “large N, small T.” (For first difference, the bias doesn’t go away even with large T).

To do instrumental variables with panel data, you use the same instructions as you would for time series data: INSTRUMENTS to set the instrument set and LINREG, AR1 or another instruction with the INST option. The main difference isn’t with the instructions used, it’s the process required to create the instrument set—in panel data sets, the instruments are often only weakly correlated with the regressors, so often large sets of them are used. See, for instance, the ARELLANO.RPF example.

Panel Unit Roots and Cointegration

It’s extremely difficult with a single time series of limited length to tell the difference between the permanent response to a shock implied by a unit root, and a response with a half-life of 20 quarters (dominant root roughly .97). As in other cases, panel data can offer an alternative way to bring more data to bear on a questions. If we can’t get a longer time span from one country, what about doing joint inference on multiple countries?

There are many panel unit root testing procedures. This is probably not surprising given how many testing procedures there are for single time series. And with panel data, there is the added complication that there are different choices for what's homogeneous and what's heterogeneous. If (as is typical) the null is that all individuals have a unit root, is the alternative that none of them do? That some might and some might not? Are the short-term dynamics different from one individual to the next? And so on.

The procedures available are @LEVINLIN for the Levin-Lin-Chu(2002) test, @HTUNIT for the Harris-Tzavalis(1999) test, @IPSHIN for the Im-Pesaran-Shin(2003) test, @BREITUNG for the Breitung(2000) test and @HADRI for the Hadri(2000) test. These are covered in great detail as part of the Panel and Grouped Data course.

With cointegration, you can test or estimate a model or both. Again, a major complication is deciding what is heterogeneous and what is homogeneous. For purposes of testing, allowing the cointegration vectors to vary across individuals is by far the simplest. The procedure @PANCOINT produces test statistics aggregated across individuals using several different underlying tests that would be applied to a single individual, such as Engle-Granger tests.

For estimation, it's also simpler to assume that the cointegrating vectors are heterogenous. Two procedures are available for handling this, both based upon the work of Peter Pedroni. @PANELFM does Fully Modified least squares (individual by individual) while @PANELDOLS does dynamic OLS. One thing to note, particularly with @PANELDOLS, is that because the estimation is done individual by individual, it is very easy to run out of usable data points. (DOLS uses lags and leads of every right hand side endogenous variable, which can make for a very large regression very quickly). Assuming a homogenous cointegrating vector requires a more complicated approach since that has to coexist with almost certainly heterogeneous short-run dynamics.

Panel VAR's

What someone means by "Panel VAR" depends upon the context. The seminal paper on panel VAR's is Holtz-Eakin, Newey and Rosen(1988), which has a (very) large N–small T data set, with lag coefficients fixed across individuals and only intercepts varying. In practice, that's unlikely to be the desired model if you have a (more typical of VAR work) small N–large T data set. For that, neither fully heterogeneous regressions (where there is little reason for even thinking of the data as a "panel") nor fully homogeneous regressions (which can be estimated by just running a standard VAR on the panel-stacked data) are likely to be interesting. Instead, it probably makes more sense to use some type of shrinkage or mean-grouped estimator. An example of that is provided as part of the Panel and Grouped Data course.

Panel GARCH

In reality, almost any use of multi-variate GARCH is applied to a "panel" data set, in the sense that it is a small N–big T set of data on similar securities. However, those are analyzed as a multivariate time series, rather than panel, because there is typically relatively little that is considered homogeneous across the series. (Each has its own mean and often its own variance process). Cermeno and Grier(2006) propose a number of models which apply individual homogeneity to various parameters in a multivariate GARCH model. However, it is (like other multivariate GARCH models) applied most easily to a set of separate time series rather than a stacked panel data set.