Logistic Regression (HD)

The Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event based on the values of a set of independent variables.

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool n/a
Note: The Logistic Regression (HD) operator is for Hadoop data only. For database data, use the Logistic Regression (DB) operator.

For more detailed information on logistic regression, use cases, and this operator, see Probability Calculation Using Logistic Regression.

Algorithm

The Hadoop implementation for logistic regression implements a multinomial logistic regression (MLOR) algorithm (and a Regularization Penalizing Parameter to avoid overfitting a model with too many variables). Multinomial logistic regression refers to the instance in which the criterion can take on three or more possible outcomes (for example, "better' vs. "no change" vs. "worse"). Team Studio can run a multinomial regression when there are more than two categorical values for the dependent variable.

For multinomial logistic regression, the Logistic Regression operator computes a probability for a categorical dependent variable, indicating the probability of each of the class values events occurring (such as chances of having blue eyes). The algorithm uses the statistical fitting of a multinomial logit function to a data set to calculate the probability of the occurrence of a multi-category dependent variable that allows two or more discrete outcomes. For multinomial logistic regression, the dependent variable must be nominal (or categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way).

The Team Studio multinomial logistic regression implementation allows for a Regularization Penalty Parameter that can be applied to prevent of the chances of overfitting the model.

Note: Multinomial regressions are not currently supported in the database implementation, only in the Hadoop implementation.

Input

A data set that contains the dependent and independent variables for modeling.

Bad or Missing Values
Predictions are made only for rows that contain data. Rows with missing data are skipped.

Configuration

As of Team Studio 6.3, the Logistic Regression operator searches the parameter space of λ and α and automatically select the highest performing model. To use this feature, give either a comma-separated list (for example, .1,.2,.3) for λ or α, or start:end:step (for example, 0:1:.1). The operator computes all possible λ and α combinations, and the output from the operator is the model with the highest classification performance. The results of every parameter combination are visible in the results console under the Parameter optimization results tab.

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column The quantity to model or predict. A dependent column specified for the logistic regression. Select the data column to be considered the dependent variable for the regression. The Dependent Column is often a categorical type, such as eye color = blue, green, or brown.
Note: For binomial logistic regression, the Dependent Column must be able to be categorized for a "Yes", "No" prediction (that is, it cannot have more than two distinct values) while for multinomial logistic regression, the Dependent Column can have multiple categorical values to predict.
Maximum Number of Iterations Specifies the total number of regression iterations that are processed before the algorithm stops if the coefficients do not converge, or show relevance. The possible range for Maximum Number of Iterations must be a non-decimal value >= 1.

Default value: 10.

Tolerance Logistic regression requires a tolerance value to be specified. This is used to determine the maximum allowed error value for the IRLS calculation method. When the error is smaller than this value, the logistic regression model training stops. The possible range for tolerance must be a decimal value >= 0.

Default value: 0.0001.

Columns Specifies the independent variable data columns to include for the regression analysis or model training. At least one Column or one Interaction variable must be specified.

Click the Columns button to open the dialog box for selecting the available columns from the input data set for analysis. For more information, see Select Columns Dialog Box .

Interaction Parameters Enables the selection of available independent variables as those data parameters thought to have combined effect on the dependent variable.

Creating Interaction Parameters is useful when you believe the combined interaction of two independent variables is not additive.

To define an Interaction Parameter, click the Interaction Parameters button and select the suspected interacting data columns.

If you have feature A and feature B, selecting * uses both A, B, and the interaction A*B as independent features. Selecting : means that only A*B are used in the model.

Use approximation (faster) Implements a single pass of the Hadoop Management System, allowing for a faster, although not as accurate, modeling process.
Note: Approximation process essentially partitions the data into smaller amounts of data in order to churn the model in memory and average out the outcome.

Default value: no.

Use Regularization Indicates that you can use an optimization parameter for logistic regression as with the Penalizing Parameter below.

Default value: no.

Penalizing Parameter (λ) Represents an optimization parameter for logistic regression. It implements regularization of the trade-off between model bias (significance of loss function) and the regularization portion of the minimization function (variance of the regression correlation coefficients). The value can be any number greater than 0.

The higher the Lambda, the lower the chance of overfitting with too many redundant variables. (Overfitting is the situation in which a model does a good job "learning" or converging to a low error for the training data but does not do as well for new, non-training data.)

For Linear Regression equations, it is possible to use cross-validation process to pick the best Lambda value. However, for logit non-linear regression equations, Lambda should be defined based on user experience with the data. For example, if a model does well on the training data but does not perform well on the predictions, you might want to increase the Lambda value.

Default value: 0, which indicates no penalty.

Use Regularization for Feature Selection Only To retrain your model for feature selection and variable importance, enable this option as well as Use Regularization.
  1. We train a model on the data set with regularization, from which we determine what features to drop.
  2. A new ordinary regression is trained with those chosen features, obtaining a model with p-values.

The output summary lists any features that were removed and the p-values are included for the second (re-trained) model.

Default value: No.

Elastic Parameter (α) A constant value between 0 and 1 that controls the degree of the mix between L1 (Lasso) and L2 (Ridge) regularization. It is the α parameter in the Elastic Net Regularization loss function:

The Elastic Parameter combines the effects of both the Ridge and Lasso penalty constraints. Both types of penalties shrink the values of the correlation coefficients.

  • When Elastic Parameter (α) = 1, it becomes pure L1 Regularization (Lasso).
    • A Lasso constraint tends to remove redundant variables and therefore results in a sparse coefficient model, which is useful when there is high dimensionality in the data.
  • When the Elastic Parameter (α) = 0, it becomes pure L2 Regularization (Ridge).
    • A Ridge constraint tends to make the variables have similar correlation coefficients and is useful when the data has few variables, or low dimensionality.
  • When the Elastic Parameter (α) falls between 0~1, it implements a mix of both L1 (Lasso) and L2 (Ridge) constraints on the coefficients.

Default value:1 (L1 or Lasso Regularization). A value of 0.5 would implement a compromise between the Lasso and Ridge constraints.

The Elastic Parameter (α) parameter only applies when Use Spark is set to Yes.

Use Spark If Yes (the default), uses Spark to optimize calculation time.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Note: In addition to the required logistic regression Prediction operator, it is helpful for the modeler to add Model Validation operators to get further model accuracy statistics (from the Goodness of Fit operator) and/or visual outputs (from the ROC and LIFT operators). See the Model Validation operators section for details.
Visual Results
The Summary tab displays the Iteration count, Deviance, Null Deviance, Chi-squared value, and Fraction of Variance Explained statistical values.



  • Number of iterations: Indicates the number of times the logistic regression re-weighting process was run. When Iteration = Maximum Number of Iterations, it flags that the regression might not have yet converged or there was a fit failure (that is, no correlation pattern was uncovered).
  • Deviance: Used as a statistic for overall fit of a logistic regression model. However, this number is meaningless by itself - it should be compared against the Null deviance value below or with its own value from previous runs of the regression.
    • Deviance is the comparison of the observed values, Y, to the expected values Y predicted.
    • The bigger the difference or Deviance of the observed values from the expected values, the poorer the fit of the model.
    • As more independent variables are added to the model, the deviance should get smaller, indicating improvement of fit.
  • Null deviance: Indicates the deviance of a "dumb" model - a random guess of yes/no without any predictor variables.
    • It is used as a comparison of performance against the model Deviance above.
    • The smaller the model Deviance (using predictors) can be made relative to the Null deviance (no predictors), the better the logistic regression model.
  • Chi-squared value: The difference between the Null deviance and Deviance. Chi-square technically represents the "negative two log likelihood" or -2LL deviance statistic for measuring the logistic regression's effectiveness. Chi square = Null deviance minus Deviance.
    • The hope is for the Deviance to be less than the Null deviance. Another flag for the logistic regression model not converging or there being a fit failure is having the Deviance > Null deviance, or a negative chi square. This might indicate that there is a subset of the data that is over fit on - the modeler could try removing variables and rerunning the regression.
    • Note: Since the chi square is a measure of the difference between predicted and actual, it is similar to looking at the residuals for a Linear Regression.
  • Fraction of variance explained: The Chi-squared value divided by the Null deviance. This ratio provides a very useful diagnostic statistic representing the percentage of the system's variation the model explains (compared to a dumb model). When analyzing logistic regression results, looking at the Chi-squared/Null deviance value provides a similar statistic to the R2 value for Linear Regressions. As a rule of thumb, any Chi-squared/Null deviance value over .8 (80%) is considered a successful logistic regression model fit.
Data Results
The Data tab displays the statistical fit numbers for each independent variable in the model.





  • Attribute: Displays the name of the independent variable.
  • Dependent Value: Displays for multinomial logistic regression only. Dependent Value shows the specific categorical value for the given regression statistical data. Note that the results have a row for each Attribute/Dependent Value pair.
  • Beta/Coefficient: Also represented as β, Beta is the value of the linear model Correlation Coefficient for the natural logarithms of the probability of occurrence of each independent variable in the logistic regression. Note: The Beta is also referred to as the "log scale".
  • Odds Ratio: The Odds Ratio is the primary measure of the strength of a variable's impact on the results in logistic regression (that is, the "odds" of the event happening given the value of the independent variable). It represents a probability ratio of P/(1-P), where P is the probability of an event happening and 1-P is the probability of it not happening. Note that it is actually calculated by taking the β coefficients and finding exp(B) or eB, which provides useful measure the strength of the logistic regression's independent variable impact on the outcome result. For example, a β =.75 gives an Odds Ratio of e .75 = 2.72 .75 =2.12 indicating that the probability of a success is twice as likely as the independent variable's value is increased by 1 unit.
    • The Odds Ratio is always greater than zero.
    • An Odds Ratio value of exactly 1 indicates the variable is not predictive or that the odds of a case outcome are equally likely for both groups under comparison.
    • Note: Note: The greater the Odds Ratio is than 1, the stronger the relationship between the dependent and independent variable in the logistic regression model.
  • SE/Standard Error, or SE: Represents the standard deviation of the estimated Coefficient values from the actual Coefficient values for the set of variables. It is best practice to commonly expect + or - 2 standard errors, meaning the actual value should be within two standard errors of the estimated coefficient value. Therefore, the SE value should be much smaller than the forecasted beta/Coefficient value.
  • Z-value: Very similar to the T-value displayed for Linear Regressions. As the data set size increases, the T and Z distribution curves become identical. The Z-value is the value is related to the standard normal deviation of the variable distribution. It compares the beta/Coefficient size to the SE size of the Coefficient and is calculated as follows: Z =β/SE, where β is the estimated beta coefficient in the regression and SE is the standard error value for the Coefficient. The SE value and Z-value are intermediary calculations used to derive the following, more interesting P-value, so they are not necessarily interesting in and of themselves.
  • P-value: Calculated based on the Z-value distribution curve. It represents the level of confidence in the associated independent variable being relevant to the model, and it is the primary value used for quick assessment of a variable's significance in a logistic regression model. Specifically, it is the probability of still observing the dependent variable's value if the Coefficient value for the independent variable is zero (that is, if P-value is high, the associated variable is not considered relevant as a correlated, independent variable in the model).
    • A low P-value is evidence that the estimated Coefficient is not due to measurement error or coincidence, and therefore, is more likely a significant result. Thus, a low P-value gives the modeler confidence in the significance of the variable in the model.
    • Standard practice is to not trust Coefficients with P-values greater than 0.05 (5%). Note: a P-value of less than 0.05 is often conceptualized as there being over 95% certainty that the Coefficient is relevant. In actuality, this P-value is derived from the distribution curve of the Z-statistic - it is the area under the curve outside of + or - 2 standard errors from the estimated Coefficient value.
    • Note: The smaller the P-value, the more meaningful the coefficient or the more certainty over the significance of the independent variable in the logistic regression model.
  • Wald Statistic: Used to assess the significance of the Correlation Coefficients. It is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient as follows. The Wald Statistic tends to be biased when the data is sparse. It is analogous to the t-test in Linear Regression.
Note: When assessing the Data tab of the Logistic Regression operator, a modeler mostly cares about the Odds Ratios, which indicate strength of the correlation between the dependent and independent variables, and the P-values, which indicate how much not to trust the estimated coefficient measurements.
Coefficient Results (multinomial logistic regression)
For multinomial logistic regression results, the Correlation Coefficient value for each of the specific dependent variable's categorical values is displayed.



P-Value Results (multinomial logistic regression)
For multinomial logistic regression results, the P-Value for each of the specific dependent variable's categorical values is displayed.



Standard Error Results (multinomial logistic regression)
For multinomial logistic regression results, the Standard Error value for each of the specific dependent variable's categorical values is displayed.



Wald Statistic Results (multinomial logistic regression)
For multinomial logistic regression results, the Wald Statistic for each of the specific dependent variable's categorical values is displayed.



Z-Value Results (multinomial logistic regression)
For multinomial logistic regression results, the Z-Value for each of the specific dependent variable's categorical values is displayed.



Heat Map Results (multinomial logistic regression):

For multinomial logistic regression results, the Heat Map displays information about actual vs. predicted counts of a classification model and helps assess the model's accuracy for each of the possible class values.

In this example, the Heat Map shows an overall model accuracy of 95.33%, with the highest prediction accuracy being for the class value "Iris-setosa" (100% accurate predictions) versus the lowest being for the "Iris-virginica" (88% accurate predictions).



Important: To learn more about the visualization available in this operator, see Exploring Visual Results.
Data Output
A file with structure similar to the visual output structure is available.