Statistics and Algorithms / Cross Section Data Methods /

Censored and Truncated Samples

Consider the basic regression model:

\begin{equation} y_i = X_i \beta + u_i \label{eq:censored_basicreg} \end{equation}

Suppose you wish to analyze this model using a cross-sectional sample. If the sample is chosen randomly from all available individuals, you can just use standard regression techniques. Similarly, there are no special problems if you select your sample based upon values of the exogenous variables.

In this section, we examine situations in which the sample is limited in some manner based upon the values of the dependent variable $y$. The two simplest types of samples with limited dependent variables are truncated and censored samples.

In a truncated sample, an observation is left out of the observable sample if the value of $y$ does not meet some criterion. For example, suppose you want to use payroll data to study the number of hours worked. You will have a truncated sample because your study will exclude people who work zero hours and are thus not on a payroll.

A censored sample (“tobit” model) has some observations for which we do not observe a true value of the dependent variable. For instance, in a study of unemployment duration, we will not see the true duration for an individual still unemployed at the end of the survey period.

More generally, if the value of a variable which is determined simultaneously with the dependent variable influences whether an observation is in the sample, then the sample suffers from selection bias.

We can see the statistical problem produced by such samples if we add to \eqref{eq:censored_basicreg} that the only observations we can see are where $y_i > 0$. If $X_i \beta $ is small (less than 0), observation $i$ can only be in the sample if $u$ is large enough that $y_i$ is positive. If you estimate by least squares, $u_i$ and $X_i \beta $ will be negatively correlated and estimates of $\beta$ will be biased towards zero. The mere fact that an observation is in the sample gives us at least some information about its residual.

Estimation

For the simplest forms of these, you can use the instruction LDV (limited dependent variables). Some other models can be done using MAXIMIZE, while others have a likelihood too complex for attack with maximum likelihood but can be estimated consistently by two-step methods.

Maximum Likelihood

Start with the normal linear model:

\begin{equation} y_i = X_i \beta + u_i ,u_i \sim N\left( {0,\sigma ^2 } \right)\,i.i.d. \end{equation}

If $y_i$ is truncated below at $TR$ (that is, only observations for which $y_i \ge TR$ are in the sample), the (log) likelihood element for entry $i$ is

\begin{equation} K - \frac{1}{2}\log {\kern 1pt} {\kern 1pt} \sigma ^2 - \frac{1}{{2{\kern 1pt} {\kern 1pt} \sigma ^2 }}\left( {y_i - X_i \beta } \right)^2 - \log \Phi \left( {\frac{{X_i \beta - TR}}{\sigma }} \right) \end{equation}

It turns out that a slight change to the parameterization, due to Olsen (1978), makes this a better behaved function: instead of $\{ \beta ,\sigma \} $, change to $\{ \gamma ,h\} \equiv \{ \beta /\sigma ,1/\sigma \} $. The log likelihood is then

\begin{equation} K + \log {\kern 1pt} {\kern 1pt} h - \frac{1}{{2{\kern 1pt} {\kern 1pt} }}\left( {hy_i - X_i \gamma } \right)^2 - \log \Phi \left( {X_i \gamma - hTR} \right) \end{equation}

This is what LDV does internally, so if you do a TRACE option, it will show coefficient estimates for that parameter set. The model is switched back to the standard parameterization (with recomputed covariance matrix) once estimation is finished.

If $y$ is censored below at $TR$ (we don’t observe a true value of $y$ if $y < TR$), then the log likelihood elements are (again with Olsen’s parameterization)

\begin{equation} \left\{ {\begin{array}{*{20}c} {K + \log {\kern 1pt} {\kern 1pt} h - \frac{1}{{2{\kern 1pt} {\kern 1pt} }}\left( {hy_i - X_i \gamma } \right)^2 } \hfill & {{\text{for observations which are not censored}}} \hfill \\ {\log \Phi \left( {hTR - X_i \gamma } \right)} \hfill & {{\text{for those that are}}} \hfill \\ \end{array}} \right. \end{equation}

This is often called the Tobit model. To use this with LDV, the censored values of $y$ should be equal to their limit.

Using LDV

LDV allows you to estimate a model with either censoring or truncation, and limited either above, below or on both ends. You select which form of limit the dependent variable has with the options TRUNCATE or CENSOR. The choices for each are LOWER, UPPER or BOTH. (The default is NEITHER).

To set the truncation limit, use the option UPPER for the upper limit (if it exists) or LOWER for the lower limit. The limit can be a single value for all entries, or can vary across individuals. It does, however, have to be computable prior to estimation, that is, you can’t have the limit depending upon estimated parameters. For instance, the following estimates a model with censoring at both ends with a lower limit of 0 and upper of 4. (In the application from which this is taken (from Greene's 6th edition), the dependent variable is forcibly being censored at 4 because the values above 4 were categories, rather than observable values.)

set ytop = %min(y,4)

ldv(censor=both,lower=0.0,upper=4.0) ytop

# constant z2 z3 z5 z7 z8

and, from the same application, this estimates the model truncated at zero, restricting it to the sample with non-zero values of the series AFFAIR.

ldv(truncate=lower,lower=0.0,smpl=affair) y

# constant z2 z3 z5 z7 z8

If some entries are censored, while others aren’t, you can show this by providing a series for UPPER or LOWER with entries in that set to %NA (missing value) for those that aren’t limited. Suppose, for instance, that you have a series LIMIT in the data set which is the upper limit for an entry if it’s non-zero, and shows no limit if it’s zero. You could estimate this with a sequence like:

set upper = %if(limit==0,%na,limit)

ldv(censor=upper,upper=upper) ...

Because the log likelihood is well-behaved (when reparameterized as shown), and the second derivatives are rather simple to compute, LDV always estimates by Newton-Raphson. You can use the standard set of options for computing robust covariance matrices, but keep in mind that the estimates themselves are unlikely to be consistent if the assumption of normality is wrong.

Like DDV, you can compute the “generalized residuals” with LDV by using the GRESIDS option. Note, however, that these are computed using the reparameterized model. The only difference between the generalized residuals based upon the two parameterizations is in a scale factor, which is usually of no consequence.

LDV also can do interval estimation, where the “dependent variable” really consists of a pair of values which bracket an otherwise unobservable amount. Use the option INTERVAL, and provide the bracketing values with the UPPER and LOWER options. Use an %NA for either one to show that it is unlimited; for instance, if an entry in the UPPER series is %NA, it means that the value was unbounded above. Note that you still need a dependent variable series, though its only real purpose is to show which entries should be used in estimation: any entry with a missing value in the dependent variable is skipped. The following, for instance, is from Verbeek (2008), example 7.2.4. The data were generated by starting with an initial bid (BID1). If an individual indicated an unwillingness to pay that much, a lower number was offered which also could be accepted or rejected. If the initial bid was accepted, a higher value was tried. The series UPPER and LOWER are created to show the upper and lower bounds created by this sequence.

set upper = %if(nn,bidl,%if(ny,bid1,%if(yn,bidh,%na)))

set lower = %if(nn,%na,%if(ny,bidl,%if(yn,bid1,bidh)))

ldv(upper=upper,lower=lower,interval,gresid=gr) bid1

# constant

Using MAXIMIZE

It’s useful to see how to estimate a limited dependent variables model using MAXIMIZE, in case you get an application which doesn’t fit the form of LDV.

As was the case with the logit and probit models, this is simplified by creating a FRML which computes $X_i \beta $.

To estimate a regression truncated below, do something like the following (where the series LOWER is set equal to the truncation values):

nonlin(parmset=sparms) sigmasq

linreg y

# constant x1 x2

frml(lastreg,vector=b,parmset=bparms) zfrml

compute sigmasq=%seesq

frml truncate = (z=zfrml(t)) ,$

%logdensity(sigmasq,y-z)-%logcdf(sigmasq,z-lower)

maximize(method=bfgs,parmset=bparms+sparms) truncate

This estimates the model using the standard parameterization, starting from the least squares estimates.

The FRML for the log likelihood function for censored observations requires use of %IF to select the appropriate branch.

frml tobit = (z=zfrml(t)) , %if(y==lower,$

%logcdf(sigmasq,z-lower), $

%logdensity(sigmasq,y-z))

For example, this would estimate the example from Greene using MAXIMIZE censoring below (only) at 0:

linreg y

# constant z2 z3 z5 z7 z8

nonlin(parmset=sparms) sigmasq

frml(lastreg,vector=b,parmset=bparms) zfrml

compute sigmasq=%seesq

frml tobit = (z=zfrml(t)) , %if(y==0,$

%logcdf(sigmasq,-z)), $

%logdensity(sigmasq,y-z))

maximize(method=bfgs,parmset=bparms+sparms) tobit

Sample Selection and Heckit Estimators

The model with censoring below at 0 is often called the Tobit model. In the original context, the dependent variable was expenditure on a car. If the individual didn’t buy a car, this showed zero. One potential problem with the tobit model is that it combines two decisions (buy a car or not, and, if so, spend how much) into a single regression equation. An alternative way to model this is to have a separate model (a probit, presumably) for the first decision, and given a “yes” answer on the first decision, a regression to explain the amount spent.

The first step probit is straightforward. The second step, however, is likely subject to selection bias. If the underlying model for the probit is to choose to purchase if and only if

\begin{equation} X_i \gamma + v_i > 0 \end{equation}

and the model for the expenditure is

\begin{equation} y_i = X_i \beta + u_i \label{eq:censored_selectivitybase} \end{equation}

(the use of the same explanatory variables is just for convenience), then, if there is a correlation between $u$ and $v$, the estimates of $\beta$ in \eqref{eq:censored_selectivitybase} will be biased. Heckman’s (1976) idea is to compute the bias in $u$:

\begin{equation} E\left( {u_i \left| {{\kern 1pt} {\kern 1pt} X_i ,{\rm{ }}X_i \gamma + v_i > 0} \right.} \right) \end{equation}

and adjust the second regression to take that into account. If $u$ and $v$ are assumed to be joint normal and i.i.d. across individuals, then $u$ can be written:

\begin{equation} u_i = \lambda v_i + \xi _i \end{equation}

where $v_i$ and $\xi_i$ are independent normals. So

\begin{equation} E\left( {u_i \left| {{\kern 1pt} {\kern 1pt} X_i ,{\rm{ }}X_i \gamma + v_i > 0} \right.} \right) = E(\lambda v_i + \xi _i \left| {{\kern 1pt} {\kern 1pt} X_i ,{\rm{ }}X_i \gamma + v_i > 0} \right.) = E(\lambda v_i \left| {{\kern 1pt} {\kern 1pt} X_i ,{\rm{ }}X_i \gamma + v_i > 0} \right.) \end{equation}

With $v$ as a standard normal (the usual assumption for a probit), then it can be shown that

\begin{equation} E\left( {v{\kern 1pt} {\kern 1pt} \left| {{\kern 1pt} {\kern 1pt} v > - z} \right.} \right) = \frac{{{\kern 1pt} \phi {\kern 1pt} {\kern 1pt} \left( z \right)}}{{\Phi {\kern 1pt} {\kern 1pt} \left( z \right)}} \end{equation}

where $\phi$ is the density of the standard normal and $\Phi$ is its cumulative density. ${\phi \mathord{\left/{\vphantom {\phi \Phi }} \right.} \Phi }$ is the reciprocal of Mills’ ratio, which you can obtain using the MILLS option of PRJ after estimating the first step probit model. Since $\lambda$ is unknown, adding the inverse Mills’ ratio to the regressor set and allowing its coefficient to be estimated will give (under suitable conditions) a consistent estimator of $\beta$. This estimator is sometimes known as Tobit II or Heckit.

The options on PRJ which are useful for doing these types of two-step estimators are MILLS (inverse Mills’ ratio) and DMILLS (derivative of the inverse Mills’ ratio). The above was all based upon a first stage probit, which conveniently is based upon a standard normal with bottom truncation at zero. The MILLS option, however, can do a broader range of calculations. In particular, it can handle a non-unit variance, and truncation either at the top or bottom. In general, the expected value for a truncated $N(0,\sigma ^2 )$ is given by

\begin{equation} E\left( {v{\kern 1pt} {\kern 1pt} \left| {{\kern 1pt} {\kern 1pt} v > - z} \right.} \right) = \frac{{{\kern 1pt} \sigma \phi {\kern 1pt} {\kern 1pt} \left( {z/\sigma } \right)}}{{\Phi {\kern 1pt} {\kern 1pt} \left( {z/\sigma } \right)}} \end{equation}

When we’re looking to compute the expected value of a residual from the projection in a first stage estimator whose distribution is truncated at the value $T_i$, the normalized $z_i$ takes the following values:

\begin{equation} z_i = \left\{ {\begin{array}{*{20}c} {{\text{Bottom truncation}}} \hfill & {{{\left( {X_i \hat \gamma - T_i } \right)} \mathord{\left/ {\vphantom {{\left( {X_i \hat \gamma - T_i } \right)} \sigma }} \right. } \sigma }} \hfill \\ {{\text{Top truncation}}} \hfill & {{{\left( {T_i - X_i \hat \gamma } \right)} \mathord{\left/ {\vphantom {{\left( {T_i - X_i \hat \gamma } \right)} \sigma }} \right. } \sigma }} \hfill \\ \end{array}} \right. \end{equation}

The MILLS option computes the value of for these values. PRJ has three options for describing the truncation and normalization procedure.

scale=the value of $sigma$ [1]

upper=SERIES or FRML of upper truncation points [unused]

lower=SERIES or FRML of lower truncation points [series of zeros]

The truncation points can differ among observations. For instance, cutoffs may depend upon some demographic characteristics. However, truncation must either be top truncation for all observations or bottom truncation for all. Note that, by default, the calculation is precisely the one needed for the Tobit II estimator with its first stage probit.

As first step in “heckit” procedure, estimate a probit for the INLF variable (In Labor Force), using the explanatory variables from the hours equation

ddv(dist=probit) inlf

# constant nwifeinc educ exper expersq age kidslt6 kidsge6

Compute the inverse Mills’ ratios from this

prj(dist=probit,mills=lambda2)

Run OLS again, including LAMBDA2 as an explanatory variable

linreg(smpl=inlf) lwage

# educ exper expersq constant lambda2

Example

TOBIT.RPF provides an example of both maximum likelihood estimation for a tobit model, and two-stage (tobit II estimation).