Logistic Regression - MADlib

The binomial Logistic Regression (MADlib) operator models the relationship between a dichotomous dependent variable and one or more predictor variables.

Information at a Glance

Category Model
Data source type DB
Sends output to other operators Yes
Data processing tool MADlib

Algorithm

  • The dependent variable is a Boolean value that can be represented with a Boolean expression.
  • (Binomial) logistic regression refers to a stochastic model in which the conditional mean of the dependent dichotomous variable is the logistic function of an affine function of the vector of the independent variables.
  • Logistic regression finds the vector of coefficients that maximizes the likelihood of the observations.
  • Currently, logistic regression in MADlib can use one of the following three algorithms.
    • Iteratively Reweighted Least Squares
    • A conjugate-gradient approach, also known as Fletcher-Reeves method in the literature, where the Hestenes-Stiefel rule is used to calculate the step size.
    • Incremental gradient descent, also known as incremental gradient methods or stochastic gradient descent in the literature.

See the Official MADlib documentation for more information.

Input

A data set that contains the dependent and independent variables for modeling.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADlib Schema Name Schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set.

If a "madlib" schema exists in the database, this parameter defaults to madlib.

Model Output Schema Name Name of the schema where the output is stored.
Model Output Table Name of the table that is created to store the regression model. Specifically, the model output table stores: [ group_col_1 | group_col_2 | ... |] coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations

See the official MADlib logistic regression documentation for more information.

Drop If Exists
  • If Yes (the default), drop the existing table of the same name and create a new one.
  • If No, stop the flow and alert the user that an error has occurred.
Dependent Variable

Must be a Boolean value to model or predict. The list of the available data columns for the Regression operator is displayed. Select the data column to be the dependent variable for the regression.

Independent Variables Specifies the independent variable data columns to include for the regression analysis or model training. You must specify at least one column. Click Select Columns to open the dialog box for selecting the available columns from the input data set for analysis.
Grouping Columns Specifies at least one column to group the input data and build separate regression models for each group. Click Select Columns to open the dialog box for selecting the available columns from the input data set for grouping.
Maximum Iterations The computation stops after the number of iterations is greater than the Maximum Iterations or the difference between log-likelihood values in successive iterations is less than the Convergence Tolerance.
Optimizer Computes the model, which can be one of the following algorithms.
  • Iteratively Reweighted Least Squares
  • Conjugate-Gradient, also known as Fletcher-Reeves method in the literature, where the Hestenes-Stiefel rule for calculating the step size is used.
  • Incremental Gradient Descent, also known as incremental gradient methods or stochastic gradient descent in the literature.
Convergence Tolerance The difference between log-likelihood values in successive iterations that indicate convergence. A zero disables the convergence criterion, so that execution stops after the maximum number of iterations is complete, as set in Maximum Iterations.
Verbosity Set to true (the default) to log all SQL console output of the results of training.

Output

Visual Output
Output is displayed in a single tab. For further output and assessment of the quality of the Logistic Regression model, add ROC and Lift operators, in addition to the required Logistic Regression Prediction operator.

The Logistic Regression (MADlib) operator output includes the coefficients (beta) of the model, the Odds Ratio, the standard error (SE), the Z-value, and the P-value statistics.



Data Output
None. This is a terminal operator.