Example 1: Predicting Success/Failure
The logit and probit models for binary dependent variables will first be reviewed; these models are "pre-wired" into the Nonlinear Estimation module and can be chosen as dialog options. Numerous examples of how to fit models and use different loss functions specified by the user will then be reviewed. Note that Logistic (Logit) and Probit regression models can also be fit using the Generalized Linear/Nonlinear Models (GLZ) facilities of Statistica; GLZ contains options for fitting ANOVA and ANCOVA-like designs to binomial and multinomial response variables, and provides methods for stepwise and best-subset selection of predictors.
This example is based on a data set described in Neter, Wasserman, and Kutner (1985, page 357; however, note that those authors fit a linear regression model to the data). Suppose you want to study whether experience helps programmers complete complex programming tasks within a specified amount of time. Twenty-five programmers were selected with different degrees of experience (measured in months). They were then asked to complete a complex programming task within a certain amount of time. The binary dependent variable is the programmers' success or failure in completing the task. These data are recorded in the file Program.sta; shown below is a partial listing of this file.


Double-click Quick Logit regression on the Quick tab to display the Logistic Regression (Logit) dialog.
Next, click the Variables button to display the standard variable selection dialog and select the variable Success from the Dichotomous dependent variable list and Expernce from the Continuous independent variable list. Click the OK button.
Statistica will automatically enter the codes for the dependent variable in this dialog. You can also specify the type of missing data deletion (Casewise deletion of missing data or Mean substitution of missing data).

Accept the program defaults and click the OK button in the Logistic Regression (Logit) dialog to display the Model Estimation dialog. In this dialog, you can select the estimation method as well as specify the convergence criterion, start values, etc. You can also elect to compute separately (via finite difference approximation) the asymptotic standard errors for the parameter estimates. For this example, on the Advanced tab, select the Asymptotic standard errors check box .

To review the descriptive statistics for all selected variables, on the Review tab, click the Means & standard deviations button. As in most other descriptive statistics spreadsheets in Statistica, the default graph is the histogram with the normal curve superimposed (right-click on the EXPERNCE column in the spreadsheet and select Graphs of Input Data - Histogram EXPERNCE - Normal Fit from the resulting shortcut menu). Thus, you could at this point evaluate the distributions of the variables.

The different estimation procedures in the Nonlinear Estimation module are discussed in the Introductory Overviews. Click the Estimation method drop-down box on the Advanced tab of the Model Estimation dialog to see the different options.

A good way to start the analysis is with the default settings in this dialog. As discussed in the Introductory Overviews, all estimation procedures require as input start values, initial step sizes, and the convergence criterion. Again, simply accept the defaults as shown in this dialog and click the OK button to estimate the parameters.

Now, review the parameter estimates by clicking the Summary: Parameters & standard errors button on the Results dialog - Quick tab. As described in the Introductory Overviews, the standard errors are computed from the finite difference approximation of the Hessian matrix of second-order derivatives. By dividing the estimates by their respective standard errors, you can compute approximate t-values, and thus compute the Statistical significance levels for each parameter. The results in the spreadsheet show that both parameters are significant at the p<.05 level.



All cases with a predicted value (probability) less than or equal to .5 are classified as Failure, those with a predicted value greater than .5 are classified as Success. The Odds ratio is computed as the ratio of the product of the correctly classified cases over the product of the incorrectly classified cases. Odds ratios that are greater than 1 indicate that the classification is better than what one would expect by pure chance. However, remember that these are post-hoc classifications, because the parameters were computed so as to maximize the probability of the observed data (see the description of the maximum likelihood loss function in the Introductory Overviews). Thus, you should not expect to do this well if you applied the current model to classify new (future) observations.

If the residuals (observed minus predicted values) are normally distributed, they will fall approximately onto a straight line in the normal probability plot. In the current example, essentially all points (residuals) in the normal probability plot are very close to the line, indicating that the residuals are normally distributed.

Again, it appears from this plot that the residuals are basically normally distributed.

Click the OK button in the Probit Regression dialog to display the Model Estimation dialog.

Now, click the OK button to display the Results dialog.
Now, on the Results dialog - Residuals tab, click the Observed, predicted, residual vals button to look at the predicted values of the dependent variable Success.

As you can see in the results spreadsheet above, the predicted values (probabilities of success) for each case under the probit model are very similar to those for the logit model. In fact, in most cases, the difference between these two models is negligible.