Statistics and Algorithms / Non-Linear Optimization /

Constrained Optimization

By default, non-linear optimization assumes that the free parameters are allowed to take any values. Now some functions have implicit bounds on the parameters. Variance parameters, for instance, often are forced to be positive by the presence of a log(v) term in the function. There is no need to do anything special to see that this is enforced. The function value will be NA if v goes non-positive, and so RATS will be forced to take a smaller step into positive territory.

However, there are situations in which the parameters are constrained, not by the nature of the function, but by the nature of the problem. If a parameter represents a physical amount of something, a negative value may be nonsensical but an unconstrained optimization might well give us a negative estimate.

If you have a single parameter subject to this type of imposed bound, constrained optimization is simple: estimate the model unconstrained. If the parameter has a permitted value, you’re done. Otherwise, re-estimate with the bound imposed.

However, if you have several constraints, it becomes much more complicated to work through the possible combinations. Instead, you can let RATS do the work for you.

Types of Constraints

There are three basic types of constraints that you can impose on your non-linear parameters. These are all imposed using the instruction NONLIN. Note that you must define the parameters before they can appear in a constraint expression—see the examples later in this section.

1.Substitution constraints. These give one parameter as a function of others, for instance, B3=B1*B2. This type of constraint is handled directly during the estimation process, so RATS doesn't need any special techniques.

2.Equality constraints. These are similar to substitution constraints, except that the equation is not explicitly solved for one of the variables. An example is B1*B2==B3*B4. (Note that you must use the == operator). However, if you can solve out for one variable, do it. For instance, here, unless you have a strong possibility that B2 will be forced to zero, using B1=B3*B4/B2 will be more efficient.

3.Inequality constraints. Examples are B1>=0 and B1+B2+B3<=1.0. These cannot be strict inequalities. For instance, you can’t restrict a parameter to be positive: it must be constrained to be non-negative.

Algorithm

On each iteration, RATS first determines which of the inequality constraints are to be considered “active.” This includes all constraints which are violated, and any others which have non-zero Lagrange multipliers from the previous iteration. Let the active constraints (including any equality constraints) be represented as \(c{\kern 1pt} {\kern 1pt} \left( \beta \right) = 0\).

The Lagrangean for maximizing \(f\left( \beta \right)\) subject to these is

\({\cal L} = f\left( \beta \right) - \mu '{\kern 1pt} {\kern 1pt} c{\kern 1pt} \left( \beta \right)\)

where \(\mu\) is the vector of Lagrange multipliers. RATS maintains an estimate of

\({\bf{G}} = - \left( {\frac{{\partial ^2 {\cal L}}}{{\partial \beta \,\partial \beta '}}} \right)^{ - 1} \)

It computes

\({\bf{g}} = \frac{{\partial f}}{{\partial \beta }}\)

and

\({\bf{N}} = \frac{{\partial {\kern 1pt} c}}{{\partial \beta }}\)

Using these, a new set of Lagrange multipliers is computed using the formula:

\(\mu = \left( {{\bf{N'GN}}} \right)^{ - 1} \left( {{\bf{N'GN}} - c} \right)\)

These are then used to compute the direction vector

\({\bf{d}} = {\bf{G}}\left( {{\bf{N}}\mu + {\bf{g}}} \right)\)

This direction vector would, in a perfect world, take us to a zero gradient point for the Lagrangean.

With the direction chosen, the question then is how far to move in that direction. RATS searches over the step size \(\lambda\) using the following penalty function

\(f\left( {\beta + \lambda {\kern 1pt} {\kern 1pt} {\bf{d}}} \right) - \frac{1}{r}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left\| {{\kern 1pt} {\kern 1pt} c\,\left( {\beta + \lambda {\kern 1pt} {\kern 1pt} {\bf{d}}} \right){\kern 1pt} {\kern 1pt} {\kern 1pt} } \right\|^{\,2} \)

The scale factor \(r\) changes from iteration to iteration, becoming smaller as the estimates come closer and closer to meeting the constraints. With a new set of parameters in hand, RATS then updates the estimate of \(\mathbf{G}\) using the bfgs formula and moves on to the next iteration. Any constrained estimation done with a hill-climbing method will use this variant of BFGS, even if you selected the BHHH or Gauss–Newton methods.

Constrained optimization is usually quite a bit slower than a similar problem with no constraints. This is not due to the extra calculations done in each iteration. Instead, it simply takes more iterations, often many more. Even if the set of active constraints stays the same throughout the estimation process as the Lagrange multipliers change in the early going, the Lagrangean itself changes. Thus, we’re trying to hit a moving target. Once the Lagrange multiplier estimates settle down, the estimates will start converging more smoothly.

When setting up your constraints, be aware of the sensitivity to scale. If, for instance, you were to put in a constraint such as 1000000*B1>=0, a fairly modest negative value of B1 would cause a large value in this constraint’s component of \(c\). This would likely so overwhelm the other constraints that all the early iterations would be devoted to forcing this single constraint to hold.

The REJECT Option

Most RATS instructions that do non-linear optimization include a REJECT option. With this, you provide a function or formula that evaluates to “true” (non-zero) for any situation where you want the overall function to return an NA. An example is the option

reject=(abs(g1+g2)>=1.0)

which (in the situation to which this applies) eliminates from consideration any unstable combinations of the G1 and G2 parameters.

In the early iterations of a non-linear estimation, it’s quite possible for rather extreme values to come up for evaluation. In most cases, you’ll just get a really low likelihood (or a natural NA, if, for instance, they would require taking the log of a negative number), and they’ll be rejected on that basis. REJECT can (and should) be used if something worse happens than a really low likelihood.

You cannot use REJECT as a cheap way to impose a constraint on parameters when the boundary is feasible. If, for instance, your function is computable at \(x \ge 1\) and you want to impose \(x \le 1\), the option REJECT=(X>1.0) will cause the optimization to fail to converge if the function is increasing at \(x = 1\). Given the shape of the function, the optimization will naturally be testing values of \(x\) larger than 1. When it’s told that the value there is NA (as a result of the REJECT), it will cut the step size, but will probably have to cut it quite a few times before getting down below 1.0. The next iteration will likely need even more subiterations in order to get under 1, etc. Eventually, it will hit a subiterations limit.

Instead, use the constrained optimization technique described above, by including x x<=1.0 in the parameter set. This uses a penalty function for values above 1.0, but that penalty function starts relatively small, so at least initially (given the shape of the function), it will have a higher total (likelihood less penalty) for values above 1. The penalty function increases as you iterate more, eventually dominating the function value and forcing x towards 1. You’ll probably end up with the final estimated x being something like 1.000001.