Panel and Grouped Data

The following pages provide the basic information you need to use panel data sets in RATS. Topics covered include data set organization, reading data into RATS, and statistical techniques.For greater coverage on the specific subject of panel data, check out our Panel and Grouped Data course.

Organization of Panel Data Sets

Panel data refers to data sets which include data for several individuals (or firms, countries, etc.) over some number of time periods. The data set itself will usually have one of the following structures:

A.Each variable has a single long block of values which contains data for all individuals. This is typical for survey data. The data can be grouped by individual or by time. For example, the data in the column on the left are grouped by individual (all time periods for the first individual, followed by all time periods for the second individual, etc.), while the column on the right is grouped by time (time period 1 for all individuals, time period 2 for all individuals, etc.):

Grouped by Individual Grouped by Time

Individual 1, 1997:1 Individual 1, 1997:1

Individual 1, 1998:1 Individual 2, 1997:1

Individual 1, 1999:1 Individual 1, 1998:1

Individual 2, 1997:1 Individual 2, 1998:1

Individual 2, 1998:1 Individual 1, 1999:1

Individual 2, 1999:1 Individual 2, 1999:1

We are considering here only situations where there are a fixed number of time periods per individual.

B.Each individual has a separate time series for each variable. This is most common when data are assembled from different sources.

C.Each variable has a single long block of values, with a separate series identifying which individual is represented by an observation.

Panel and Grouped Data

The ideal arrangement for panel data for RATS is type “A” data grouped by individual. RATS stores a panel data series as an \(N \times T\) stream of data, where \(N\) is the number of individuals in the series, and \(T\) is the number of time periods. The first \(T\) entries of the series represent all the time periods for the first individual, the next \(T\) entries represent all the entries for individual two, and so on. Panel data series always have the same number of entries per individual. Missing values are used to pad ranges if the data aren’t aligned naturally.

RATS can also handle the type "C" data as "grouped" data. Many of the same operations can be applied to grouped data, but only if the model you want to estimate doesn't require lags—it's the blocking by individual with aligned time periods in the panel scheme which allows that to be handled properly. If you need the time sequencing to be correct and your data are organized differently, you will need to convert it. The instruction PFORM can be used for most such conversions.

CALENDAR and ALLOCATE with Panel Data

To work with panel data sets, you need to use the appropriate CALENDAR and ALLOCATE instructions. The syntax of CALENDAR for correctly formatted panel data is

calendar(panelobs=periods,other options) parameter (optional, describes time series structure)

The PANELOBS option gives the number of time periods per individual. The rest of the CALENDAR instruction is the same as for time series data: it describes the time series structure of the data within each cross-section unit. However, you don’t need to set this other information to use the panel data features of RATS.

The syntax for an ALLOCATE for panel data is

allocate individuals//lastperiod

where individuals is the total number of individuals in the panel set. Technically, lastperiod is the number of time periods for the last individual, though it invariably is just the “periods” from CALENDAR.

After you have things set up, you can use the functions %PANELSIZE() (number of individuals) and %PANELOBS() (number of time periods) if you want to make a program flexible enough to handle different pane sizes.

Examples

calendar(panelobs=20)

allocate 50//20

This is an undated panel data set, with 50 individuals, and 20 time periods (observations) per individual.

calendar(panelobs=48,q) 2006:1

allocate 20//2017:12

This is a panel data set, with 20 individuals and 48 time periods per individual. Each cross-sectional unit has quarterly data, starting with January, 2006.

Referencing Series Entries

With panel series, you reference time period n of individual m as m//n. For example:

compute starting2 = income(2//1)

sets STARTING2 to entry 1 for individual 2 of INCOME, and

declare vector firstper(20)

ewise firstper(i)=panelseries(i//1960:1)

creates FIRSTPER as a VECTOR of the 1960:1 entries for the first 20 individuals. Note that m and n can be any integer value or an integer-valued expression. As shown above, n can also be a date if you specified a date scheme on your CALENDAR.

Reading Panel Data From a File

You bring panel data into RATS using the same procedure as time series data.

1.Use CALENDAR(PANEL=...) and ALLOCATE instructions.

2.Open a data file with OPEN DATA.

3.Read in the data with a DATA instruction.

RATS format is the only one which really “understands” panel data. With your data in RATS format, you can select a reduced set of entries from within each cross-section. For instance, if you have data from 1975, but only want to work with the data from 1990, you can set a CALENDAR appropriately:

calendar(panel=84,q) 1990:1

allocate 13//2010:4

You can also “pad” out the cross-sectional units. If you have data only for 8 observations apiece, but use PANEL=10 on CALENDAR, DATA will insert missing values for entries 9 and 10 within each unit.

However, if you are reading the data from any other format, the organization of the data on the file must match the CALENDAR exactly. Each individual must have exactly the number of periods specified by the PANELOBS option. See “Forming a Panel Data Set” for instructions on getting your data into this format.

Special Functions

RATS has two functions which can be very useful when working with panel data.

•%PERIOD(t) returns the time period corresponding to entry t. This can be very useful for setting time period dummies (see below).

•%INDIV(t) returns the number of the individual corresponding to entry t.

For example, given a panel set with 2 individuals, and five periods per individual, entry 6 is the first period for the second individual, so %PERIOD(6) is 1 and %INDIV(6) is 2.

set ltrend = %period(t)

creates LTREND as a trend which repeats within each cross-sectional unit. If you are using a “dated” CALENDAR, you can use the date functions (%YEAR, %MONTH, %DAY, %WEEKDAY) to determine the year, month, day, or day of the week of a given entry.

SET(NOPANEL)

In SET and FRML, a reference to an entry in a panel data series is treated as missing if it’s in the zone for a different individual than the one for the current entry T. This is usually what you want, since it prevents lags of a series from crossing the boundary into another individual’s data range. This becomes a problem, however, when you want the transformation to cross boundaries. Perhaps you’re doing bootstrapping, and the population from which you’re drawing is the full set of observations. Or perhaps you’re trying to extract data from across individuals into a new series. Use the NOPANEL option on a SET instruction to turn off this handling. For instance, the following does a random draw from all 200 observations of the series RESIDS:

boot select 1//1 10//20

set(nopanel) udraw = resids(select(t))

Handling Lags

If you run a regression involving lags or leads, any observation which requires a lagged or led value not available within a cross-section is dropped from the regression. Similarly, if you use a lag or lead in a SET instruction, you will get a value of NA if the lag or lead goes outside the cross-section. For instance, in

set dx = x-x{1}

DX will have a missing value in the first entry in each cross-sectional unit.

If you need a lag series to have a value of zero when it goes out of the individual’s range, you can do that with something like this:

dec vect[series] flags(n-2)

do i=n-1,2,-1

set flags(i-1) = %if(%period(t)<=i,0.0,lcrmrte{i})

end do i

This generates a set of lag series for lags 2 to n-1, each of which zeros out when the time period is less than or equal to the lag being created.

Creating Dummies

If you need separate dummies for each individual, you can use the PANEL instruction with the DUMMIES option. For instance

panel(dummies=dummies)

will create a VECT[SERIES] called DUMMIES where DUMMIES(1) is a dummy for individual 1, DUMMIES(2) for individual 2, etc. You can also use SET with the %INDIV function to do dummies like this. For instance

dec vector[series] dummyx(%panelsize())

do i=1,%panelsize()

set dummyx(i) = (%indiv(t)==i)*xreg

end do i

This creates the VECTOR[SERIES] called DUMMYX with dummied-out copies of XREG.

Selecting Subsamples

If you want to limit an estimation to a consecutive range of entries, you can simply specify the endpoints using the start and end parameters as you would with non-panel data. For example, to limit a regression to a range of observations for individual two, you would do something like this:

linreg y 2//2011:1 2//2016:12

# constant x1 x2

using panel-format date references.

However, this approach will not work for selecting a subset of time periods from each individual. For example, if you wanted to include observations from all individuals, you would not want to do:

linreg y 2011:1 2016:12

# constant x1 x2

RATS would just interpret these dates as referring to individual one, and would exclude all remaining individuals from the estimation.

Instead, you need to create a dummy variable with non-zero values in the entries you want to include, and then use this series with the SMPL option. The %PERIOD and %INDIV functions, and the various logical operators, are very handy for this. For example, to include data from 2011:1 through 2016:12 for all individuals:

set panelsmpl = %period(t)>=2011:1.and.%period(t)<=2016:12

linreg(smpl=panelsmpl) y

# constant x1 x2

Forming a Panel Data Set: The Instruction PFORM

If your data are not already in the correct form for a rats panel data set, you may be able to use the instruction PFORM to rearrange it. Our recommendation is that you run a program to transform the data and write it to a RATS format file. Use the RATS format file when you are actually analyzing the data.

PFORM can take several different forms depending upon the current organization of your dataset. Note that you use PFORM to create one series at a time. If you need to concatenate separate series to create a panel set, use

pform newseries

# list of individual series

for instance,

pform exrate

# australia canada france germany japan netherlands uk us

concatenates eight series into a single panel. The constructed series will have data for each individual running from the earliest valid data point across all the input series to the final one, again across all the input series. If you need to construct several such series, make sure you don't run into a problem with different patterns of missing values causing the constructed series to not be properly aligned. You can use the SMPL option to enforce a particular sample for each series. If the raw data series are actually defining a time period rather than an individual, you can use the INPUT=TIME option to have them constructed properly. For instance,

pform(input=time) logr

# logr70 logr71 logr72 logr73 logr74

creates LOGR as a series with five observations per individual.

If you have a dataset with a tag series for the individuals and another for the time period, use something like

pform(indiv=id,time=year) p_n

# n

which creates P_N from an input series N, where each individual has data running from the earliest observed value of YEAR to the last observed value.

Finally, if you have an input series which needs to be repeated across individuals or across time, you can use PFORM with the options REPEAT and INPUT=TIME or INPUT=INDIVIDUAL. Before you do this, however, you need to set up a panel CALENDAR scheme, so PFORM knows what the target locations are. INPUT=TIME is used with the more common situation where the series is a time series which is the same for each individual:

pform(input=time,repeat) p_gdp

# gdp

Panel Data Transformations: the Instruction PANEL

In addition to individual and time dummies, it is often necessary to transform data by subtracting off individual or time period means. This is done with the instruction PANEL. For the input series, PANEL creates a linear combination of the current entry (ENTRY weight), the mean of the series for an individual (INDIV weight), the mean across individuals for a given time period (TIME weight), the sums in each direction (ISUM and TSUM) and observation counts (ICOUNT and TCOUNT). For instance,

panel(entry=1.0,indiv=-1.0) series / dseries

panel(time=1.0) series / timemeans

The first of these creates DSERIES as the deviations from individual means of SERIES. The second creates TIMEMEANS as a series of means for the different time periods. This is copied across the individuals, so that, for instance, the first time period in each individual’s block will be equal to the mean of the first time period across all individuals.

You can also input component variances for a GLS transformation and let PANEL do the work of transforming the data. The following, for instance, does a standard random effects GLS transformation of TV to G_TV, based upon individual effects with VINDIV as the variance of the individual component and VRANDOM as the variance of the purely random component:

panel(vrandom=vrandom,vindiv=vindiv) tv / g_tv

This also includes the option GLS which can be used to change the form of the GLS transformation. For instance, GLS=BACKWARDS added to the above will create G_TV using only "backwards" means of TV, that is the transformed series at time period \(t\) will be constructed using only data from time periods 1 to \(t\)—there are some statistical methods where forwards or backwards construction of the transformation is important.

PANEL also has an option to compute a separate sample variance for the data for each individual or for each time period. That series can be used as a SPREAD option in a LINREG. It should only be used if you have enough data per individual to give a reasonably sharp estimate of the variance.

linreg logc

# constant logpf lf logq f2 f3 f4 f5 f6

panel(spreads=firmvar) %resids

linreg(spread=firmvar) logc

# constant logpf lf logq f2 f3 f4 f5 f6