Nonlinear Estimation Procedures - Function Minimization Algorithms

Now that we have discussed different regression models, and the loss functions that can be used to estimate them, the only "mystery" that is left is how to minimize the loss functions (to find the best fitting set of parameters), and how to estimate the standard errors of the parameter estimates. Nonlinear Estimation uses one very efficient general algorithm (quasi-Newton) that approximates the second-order derivatives of the loss function to guide the search for the minimum (i.e., for the best parameter estimates, given the respective loss function).

For nonlinear least-squares regression (i.e., nonlinear regression functions, and the least-squares loss function), Nonlinear Estimation includes a dedicated algorithm that is very efficient and robust, and is the recommended estimation method when analyzing large data sets and using the least-squares loss function): The Levenberg-Marquardt algorithm (Levenberg, 1944; Marquardt, 1963; see also Moré, 1977).

In addition, Nonlinear Estimation offers several more general function minimization algorithms that follow different search strategies (which do not depend on the evaluation of derivatives). These strategies are sometimes more effective for estimating loss functions with local minima; therefore, these methods are often particularly useful to find appropriate start values for the estimation via the quasi-Newton method.

In all cases, STATISTICA can compute (by request) the standard errors of the parameter estimates. These standard errors are based on the second-order partial derivatives for the parameters, which are computed via finite difference approximation.

If you are not interested in how the minimization of the loss function is done, only that it can be done, you can skip the following paragraphs. However, you may find it useful to know a little about these procedures in case your regression model "refuses" to be fit to the data. In that case, the iterative estimation procedure will fail to converge, producing ever "stranger" (e.g., very large or very small) parameter estimates.

In the following paragraphs we will first discuss some general issues involved in unconstrained optimization, and then briefly review the methods used in this module. For more detailed discussions of these procedures you can refer to Brent (1973), Gill and Murray (1974), Peressini, Sullivan, and Uhl (1988), and Wilde and Beightler (1967). For specific algorithms, see Dennis and Schnabel (1983), Eason and Fenton (1974), Fletcher (1969), Fletcher and Powell (1963), Fletcher and Reeves (1964), Hooke and Jeeves (1961), Jacoby, Kowalik, and Pizzo (1972), and Nelder and Mead (1964).

Start Values, Step Sizes, Convergence Criteria
A common aspect of all estimation procedures is that they require the user to specify some start values, initial step sizes, and a criterion for convergence. All methods will begin with a particular set of initial estimates (start values), which will be changed in some systematic manner from iteration to iteration; in the first iteration, the step size determines by how much the parameters will be moved. Finally, the convergence criterion determines when the iteration process will stop. For example, the process may stop when the improvements in the loss function from iteration to iteration are less than a specific amount. Nonlinear Estimation has preset defaults for all of these parameters, which are appropriate in most cases.
Penalty Functions, Constraining Parameters
All estimation procedures in Nonlinear Estimation are unconstrained in nature. That means that the program will move parameters around without any regard for whether or not permissible values result. For example, in the course of logit regression we may get estimated values that are equal to 0.0, in which case the logarithm cannot be computed (since the log of 0 is undefined). When this happens, the program will assign a penalty to the loss function, that is, a very large value. As a result, the various estimation procedures usually move away from the regions that produce those functions. However, in some circumstances, the estimation will "get stuck," and as a result, you would see a very large value of the loss function. This could happen, if, for example, the regression equation involves taking the logarithm of an independent variable which has a value of zero for some cases (in which case the logarithm cannot be computed).

If you want to constrain a procedure in Nonlinear Estimation, then this constraint must be specified in the loss function as a penalty function (assessment). By doing this, you can control what permissible values of the parameters to be estimated can be manipulated by the program. For example, if two parameters (a and b) are to be constrained to be greater than or equal to zero, then one must assess a large penalty to these parameters if this condition is not met. Below is an example of a user-specified regression and loss function, including a penalty assessment designed to "penalize" the parameters a and/or b if either one is not greater than or equal to zero:

Estimated function: v3 = a + b*v1 + (c*v2)
 Loss function: L = (obs - pred)2 + (a<0)*100000 + (b<0)*100000

Local Minima
The most "treacherous" threat to unconstrained function minimization are local minima. For example, a particular loss function may become slightly larger, regardless of how a particular parameter is moved. However, if the parameter were to be moved into a completely different place, the loss function may actually become smaller. You can think of such local minima as local "valleys" or minor "dents" in the loss function. However, in most practical applications, local minima will produce "outrageous" and extremely large or small parameter estimates with very large standard errors. In those cases, specify different start values and try again. Also note, that the Simplex method (see below) is particularly "smart" in avoiding such minima; therefore, this method may be particularly suited in order to find appropriate start values for complex functions.
Quasi-Newton Method
As you may remember, the slope of a function at a particular point can be computed as the first-order derivative of the function (at that point). The "slope of the slope" is the second-order derivative, which tells us how fast the slope is changing at the respective point, and in which direction. The quasi-Newton method will, at each step, evaluate the function at different points in order to estimate the first-order derivatives and second-order derivatives. It will then use this information to follow a path towards the minimum of the loss function.
Levenberg-Marquardt Algorithm (Nonlinear Least Squares)
Levenberg (1944) and Marquardt (1963) proposed an efficient method for estimating the parameters of nonlinear regression models, when using the least-squares loss function. Using the least squares loss function, the second-order partial derivatives do not have to be computed (or approximated) in order to find the least-squares parameter estimates; instead, the algorithm will in each iteration solve a set of linear equations to compute the gradient, which computationally is relatively (compared to other optimization techniques) easy and fast. Details of the derivation and specific steps defining the Levenberg-Marquardt algorithm are discussed in Moré (1977).  In particular when analyzing large data sets, this estimation procedure (and the least-squares loss function) is the recommended method for computing parameter estimates.  

The Levenberg-Marquardt (LM) method is an improvement/modification of the Gauss-Newton method for obtaining the minimizing solution of the nonlinear least squares problem. Consider the nonlinear model fitting y = ¦ (q,c) with the given data Xi and Yi, i = 1,...,m where Xi is of dimension k and θ is of dimension N. The LM method seeks θ* the solution of θ (locally) minimizing:

g(q)= åim=1(Yi-¦ (q,Xi ))2

The LM finds the solution applying the routine:

q j+1=q j-(J'J+lD)-1J'(Y-¦(q,Xi))

iteratively, where:

Y is the m x 1 vector containing Y1,..., Ym
X is the m x k  matrix containing X1,..., Xm
J is the  m x n Jacobian matrix for ¦(q,c) with respect to  q
D is a  n x n  diagonal matrix to adjust scale factors

The automatic scaling method explained in More (1977) is used in STATISTICA. You are asked to input your (nonlinear) model function ¦(q,c), initial parameter values (default 0.1), and the data for Y and X.

Input:

Data dependent variable Y, and independent variables X
Model model function ¦(q,c)
Initial values the value of θ in the first iteration

Output:

Minimized value of g(θ*)
Coefficient of determination R-square
Estimator θ, i.e. θ*
Covariance matrix of θ*, g(θ*)(J'J)-1/(m-n)
Correlation matrix of θ*
Confidence Interval of θ* at variable significant level
Residuals and Predicted values
ANOVA table
History of iteration for the least square values and the parameters
Simplex Procedure
This algorithm does not rely on the computation or estimation of the derivatives of the loss function. Instead, at each iteration the function will be evaluated at m+1 points in the m dimensional parameter space. For example, in two dimensions (i.e., when there are two parameters to be estimated), the program will evaluate the function at three points around the current optimum. These three points would define a triangle; in more than two dimensions, the "figure" produced by these points is called a Simplex. Intuitively, in two dimensions, three points will allow us to determine "which way to go," that is, in which direction in the two dimensional space to proceed in order to minimize the function. The same principle can be applied to the multidimensional parameter space, that is, the Simplex will "move" downhill; when the current step sizes become too "crude" to detect a clear downhill direction, (i.e., the Simplex is too large), the Simplex will "contract" and try again.

An additional strength of this method is that when a minimum appears to have been found, the Simplex will again be expanded to a larger size to see whether the respective minimum is a local minimum. Thus, in a way, the Simplex moves like a smooth single cell organism down the loss function, contracting and expanding as local minima or significant ridges are encountered.

Hooke-Jeeves Pattern Moves
In a sense this is the simplest of all algorithms. At each iteration, this method first defines a pattern of points by moving each parameter one by one, so as to optimize the current loss function. The entire pattern of points is then shifted or moved to a new location; this new location is determined by extrapolating the line from the old base point in the m dimensional parameter space to the new base point. The step sizes in this process are constantly adjusted to "zero in" on the respective optimum. This method is usually quite effective, and should be tried if both the quasi-Newton and Simplex methods (see above) fail to produce reasonable estimates.
Rosenbrock Pattern Search
Where all other methods fail, the Rosenbrock pattern search method often succeeds. This method rotates the parameter space and aligns one axis with a ridge (this method is also called the method of rotating coordinates); all other axes remain orthogonal to this axis. If the loss function is unimodal and has detectable ridges pointing towards the minimum of the function, then this method will proceed with sure-footed accuracy towards the minimum of the function. However, note that this search algorithm may terminate early when there are several constraint boundaries (resulting in the penalty value; see above) that intersect, leading to a discontinuity in the ridges.
Hessian Matrix and Standard Errors
The matrix of second-order (partial) derivatives is also called the Hessian matrix. It turns out that the inverse of the Hessian matrix approximates the variance/covariance matrix of parameter estimates. Intuitively, there should be an inverse relationship between the second-order derivative for a parameter and its standard error: If the change of the slope around the minimum of the function is very sharp, then the second-order derivative will be large; however, the parameter estimate will be quite stable in the sense that the minimum with respect to the parameter is clearly identifiable. If the second-order derivative is nearly zero, then the change in the slope around the minimum is zero, meaning that we can practically move the parameter in any direction without greatly affecting the loss function. Thus, the standard error of the parameter will be very large.

The Hessian matrix (and asymptotic standard errors for the parameters) are computed separately via finite difference approximation by selecting the Asymptotic standard errors check box from the Advanced tab of the Model Estimation dialog. This procedure yields very precise asymptotic standard errors for all estimation methods.