Penalized Least Squares

Penalized least squares is a collection of methods for trying to improve the out-of-sample prediction for a linear model by shrinking the estimator towards some fixed vector (almost always zero). A very simple example which shows why thiis works is:

Assume $\hat x$ is an unbiased estimator for the unknown $x$. Let $\sigma ^2 = E\left( {\hat x - x} \right)^2 $. Consider the family of alternative estimators $\hat x(\lambda ) = \left( {1 - \lambda } \right)\hat x$ for different values of $\lambda$. This "shrinks" the estimator towards zero when $\lambda > 0$. The mean squared error of using $\hat x(\lambda )$ to estimate $x$ is

\begin{equation} E\left( {\left( {1 - \lambda } \right)\hat x - x} \right)^2 = E\left( {\left( {1 - \lambda } \right)\left( {\hat x - x} \right) + \lambda x} \right)^2 = \left( {1 - \lambda } \right)^2 \sigma ^2 + \lambda ^2 x^2 \label{eq:penalized_simple} \end{equation}

where the cross term drops out because $Ex\left( {\hat x - x} \right) = 0$ since $\hat x$ is unbiased.

The derivative of \eqref{eq:penalized_simple} with respect to $\lambda$ is

\begin{equation} - 2(1 - \lambda )\sigma ^2 + 2\lambda x^2 \end{equation}

Evaluated at $\lambda=0$ this is the negative value $- 2\sigma ^2 $, so, at least for small values of $\lambda$, the mean squared error is smaller for the shrinkage estimator than for the original unbiased one. This is because the two terms in \eqref{eq:penalized_simple} are the variance of the estimator and the squared bias. As $\lambda$ moves away from zero, the variance goes down, while the squared bias goes up, but the increase in the squared bias is slow at first.

Now, this is fine in theory, but in practice, we do not observe $x$ so we can't tell how far we can push the shrinkage before the squared bias starts to dominate. In fact, the main issue with any of these is to come up with a reasonable way to choose what type and how much shrinkage is appropriate.

There are two principal contenders for the type of shrinkage:

•Ridge regression

•LASSO (Least Absolute Shrinkage and Selection Operator)

We have another section on Ridge/Mixed Estimation where Ridge regression is a special case of quasi-Bayesian estimation (with a zero mean prior). That can be applied to a linear regression of relatively modest size. However, Ridge and LASSO are also applied to linear models where the number of explanatory variables is quite high and conceivably could be greater than the number of data points.

While there are several ways to write the optimization problem that these solve, the most straightforward is that we have the linear model

\begin{equation} y_t = \alpha + X_t \beta + u_t \label{eq:penalty_linear} \end{equation}

Ridge regression minimizes

\begin{equation} \sum {\left( {y_t - \alpha - X_t \beta } \right)^2 } + \lambda \beta '\beta \label{eq:penalty_ridgeoptimum} \end{equation}

while LASSO minimizes

\begin{equation} \sum {\left( {y_t - \alpha - X_t \beta } \right)^2 } + \lambda \sum\limits_i {\left| {\beta _i } \right|} \label{eq:penalty_LASSOoptimum} \end{equation}

These are also called L2 penalization and L1 penalization, as they penalize least squares with a scale of the (squared) L2 norm (sum of squares) and the L1 norm respectively of the coefficient vector. The minimizer depends upon the chosen value of $\lambda$. As $\lambda \to 0$, this converges to least squares and as $\lambda \to \infty $, it approaches a vector of zeros. Note that LASSO, by the nature of its optimization problem, will often force many coefficients to be zero even at more modest values for $\lambda$ (that's the "Selection Operator" part of the LASSO name). By contrast, the coefficients in ridge regression will always be somewhat non-zero and converge more smoothly towards zero.

Choosing the Tuning Parameter

In the original paper on ridge regression, Hoerl and Kennard proposed choosing $\lambda$ graphically (by trying different values and noting where the coefficient estimates "stabilized"). The problem with this is that the decision would be judgmental and could depend upon how the graph was constructed. Instead, the selection method that is recommended is to use cross-validation—using part of the sample (the training sample) to fit the model and another part to evaluate the mean squared error (the testing sample). If we don't separate the roles, the mean squared error will be minimized by least squares. To avoid making the estimator too dependent upon a particular choice of sample splits, the common strategy is to aggregate the mean squared error across multiple sample splits. Note that this can take a while if the set of free parameters is quite large.

The PLS instruction (Penalized Least Squares) solves \eqref{eq:penalty_ridgeoptimum} or \eqref{eq:penalty_LASSOoptimum} for a specific value of $\lambda$, so the choice of this will require loops over choices of samples.

Choosing Scale Factors

In addition to the question of choosing $\lambda$, there is also a question about the appropriate way to scale the penalty among the different regressors. In \eqref{eq:penalty_ridgeoptimum} and \eqref{eq:penalty_LASSOoptimum}, the penalty function treats all coefficients the same. Of course, that means that the penalty will be affected more by a coefficient which has a naturally large scale than by a coefficient which has a naturally small one. RATS offers two ways to handle this:

1.You can input a VECTOR of known parameter weights. The values are to counteract the natural scale of the parameters so they will be small for coefficients expected to be large and large for coefficients expected to be small. You can also give zero weights to parameters (such as the intercept) for which there is no clear reason to apply shrinkage. (In practice, the analysis is usually applied to data with means extracted so the intercept isn't directly estimated.)

2.You can do the analysis using standardized (mean zero, variance 1) regressors. This means using the correlation matrix of variables rather than their cross product.

The latter is clearly more straightforward, as it doesn't require any judgment as to the particular values, and the former may be infeasible if applied to a very large model with a wide variety of regressors. However, if you are using an aggregation of different training samples, this would do a different standardization in each sample which might change how the model fits.

Using a Cross Product Matrix

Choosing a tuning parameter may require solving the penalized least squares problem hundreds of times, depending upon the method used for doing the search. The computational time can be reduced by computing the cross product (or correlation) matrix of the regressors and dependent variable just once—that provides sufficient statistics for solving either optimization problem for arbitrary values of $\lambda$. You compute this with the CMOMENT instruction. There are three forms of this which can be used:

•CMOM(CORR) computes a correlation matrix (thus the resulting matrix has 1's on the diagonals). Note that this also standardizes the dependent variable, which changes the scale of the coefficients and the range of the possible values of $\lambda$.

•CMOM(CENTER) subtracts the means off all the variables but does not rescale any of them.

•CMOM with neither the CORR or CENTER option just computes a straight cross product matrix.

If you use one of these, and use the CMOM option on PLS, the instruction knows what type of transformation was done to the variables and so is able to reverse that out to produce an estimator of the original model \eqref{eq:penalty_linear}.

Picking the Range for the Tuning Parameter

Solving the LASSO problem for a given $\lambda$ is a non-linear optimization problem. The ridge regression optimization can (theoretically) be solved by matrix calculations, but if the number of X variables is quite large (it can be in the 1000's) that may be quite time-consuming since it requires solving a system of equations of that size. Instead, PLS uses cyclic coordinate descent for both types of penalties, which solves first order conditions for each individual coefficient in a loop. Each iteration is quite simple, but it can take quite a few iterations for a model to converge. The default for the ITERATIONS option is 1000, rather than the more typical 100 for variational methods like BFGS.

Because the value of the minimand for a given $\lambda$ is calculated only approximately, standard methods for searching over values of $\lambda$ won't work since it won't, in practice, give a continuous function. Instead, you need to search over a range of values. The question arises, how do you choose that range? The mean square error is often quite flat over a fairly broad range, so a relatively coarse grid over a range of several orders of magnitude likely makes sense.

If we look at \eqref{eq:penalty_ridgeoptimum} or \eqref{eq:penalty_LASSOoptimum}, we see that each has two parts: the sum of squares and the penalty. Note that the sum of squares depends upon the number of data points, and the penalty does not. We would thus expect that $\lambda$ should go up with the size of the data set—if we used twice as much data with a given model, we would expect a tuning parameter roughly twice as large. Also, note that the L2 penalty uses the squared norm of $\beta$ while the L1 penalty uses the L1 norm itself, so the $\lambda$ for the two wouldn't be expected to be comparable.

You can approximate the value for the sum of squared residuals if you have at least some idea of what the $R^2$ will be, at least approximately, as that would be $T(1-R^2) {\mathop{\rm var}} (y)$. As to the penalty function, that will depend upon how the X's are scaled. If the X's are standardized to unit variance, there is reason to believe that $\beta \beta '$ will be at least something on the order of $R^2 {\rm var}(y)$. The procedure @PLSGrid uses these approximations to generate a grid that, in most cases, should cover the optimal value.

Evaluating the Model

For a given value of $\lambda$, we need to estimate the model using the training sample and then use the mean square error over the test sample as our objective function. This is the mean square error of using the model to predict the $y_t$ in \eqref{eq:penalty_linear}. However the variables are scaled as part of the estimation process, PLS will always substitute back to create the equation in that form. The instruction PRJ can be used to obtain the fitted values over the training sample and the instruction SSTATS can be used to compute the mean squared error over that sample as well. For instance, in LASSO.RPF, the following is used (inside a loop over test values for $\lambda$ to evaluate and save the mean squared error. (StartTest to EndTest is the test sample range).

pls(penalty=l1,cmom,lambda=lambda,noprint) longrate

# shortrate{0 to 12}

prj fitted StartTest EndTest

sstats(mean) StartTest EndTest (longrate-fitted)^2>>testmse(%doforpass)

Choosing Testing vs Training Samples

LASSO.RPF is an example applied to time series data. With that, an obvious choice between training and testing samples is to hold back the end of the data as the test sample and use the starting part of the data as the test sample.

The Stock and Watson textbook has examples of LASSO and Ridge Regression in Chapter 14. These are applied to "big data" sets using well over 1000 explanatory variables applied to a cross section data set (school districts in California). With cross section data, there is no obvious way to split up the sample, so this uses what is know as a 10-fold cross validation: dividing the data randomly into 10 groups of observations, then sequentially holding one each of the groups as the testing sample while using the other 90% of the data as the training sample. The overall MSE is obtained by summing across those 10. The simplest way to do this with RATS is to use the BOOT instruction to get random entry numbers, then use the SHUFFLE options on the instructions doing the calculations.

Because this depends upon random numbers, the results will be slightly different each time the program is run, but the overall results are quite consistent. This is the common setup:

set group = %clock(t,10)

boot(noreplace) shuffle

The GROUP series will have entries 1,11,21,... mapping to the value 1; 2,12,22,... to value 2, etc. BOOT(NOREPLACE) will randomize the set of entries. So if we take GROUP(SHUFFLE(T)) it will give a random choice of one of the 10 groups to entry T.

The pseudo-code for cross-validation is:

do leftout=1,10

Fit model using data where the shuffled sample is not equal to LEFTOUT

Evaluate MSE for the data where the shuffled sample is equal to LEFTOUT

end do leftout

Note that this needs to be done for each test value of $\lambda$ and we need to get an overall value for the MSE for each. This produces a log-linear grid that roughly reproduces the range covered in the Ridge Regression example from the book:

dec vect testlambdas(21)

ewise testlambdas(i)=10^(.125*(i+11))

dec vect testmse(%size(testlambdas))

ewise testmse(i)=0.0

(The LASSO example has a different range since LASSO uses the L1 norm for the penalty and not the squared L2 norm of the coefficients).

Inside the DO LEFTOUT loop, this computes the cross-product matrix over the training sample (all entries for which the shuffled entry number is not in group LEFTOUT.

cmom(center,shuffle=shuffle,smpl=group<>leftout)

# x cubes interact testscore

* This pulls out the sigma's for the explanatory variables

dec vect scale(%ncmom-1)

ewise scale(i)=sqrt(%cmom(i,i)/%nobs)

and this does the estimates for each of the tuning values in TESTLAMBDAS:

dofor [real] lambda = testlambdas

pls(noprint,penalize=l2,lambda=lambda,cmom,scales=scale,iters=10000) testscore

# x cubes interact

* This computes the fitted values

prj fitted

sstats(smpl=(group==leftout),shuffle=shuffle,mean) / (testscore-fitted)^2>>mse

?leftout *.##### lambda mse * %iters %cvcrit

* This adds the sample-sized-weighted MSE (i.e. the sum of squared

* errors) to the TESTMSE value for this lambda

compute testmse(%doforpass)+=mse*%nobs

end dofor lambda

(This would be somewhat simpler if CMOM(CORR) were used rather than CMOM(CENTER) with the SCALE option on PLS; however, that would require a different range of tuning parameters than in the text).

Because the number of data points isn't an exact multiple of 10, the number of observations in the training samples isn't constant; hence the TESTMSE adds up the sums of squares (MSE x the number of observations). Outside the loop, this converts the sums of squared errors to the square root of the mean squared prediction error (MSPE) and graphs.

ewise testmse(i)=sqrt(testmse(i)/1966.0)

scatter(x=testlambdas,y=testmse,style=lines,hlog=10,$

header="Figure 14.2 Square Root of MSPE for Ridge Regression",$

vlabel="Square Root of MSPE",hlabel="Lambda (Log Scale)")