Logistic Regression

Information at a Glance

Category	Model
Data source type	DB
Sends output to other operators	Yes
Data processing tool	MADlib

Algorithm

The dependent variable is a Boolean value that can be represented with a Boolean expression.
(Binomial) logistic regression refers to a stochastic model in which the conditional mean of the dependent dichotomous variable is the logistic function of an affine function of the vector of the independent variables.
Logistic regression finds the vector of coefficients that maximizes the likelihood of the observations.
Currently, logistic regression in MADlib can use one of the following three algorithms.
- Iteratively Reweighted Least Squares
- A conjugate-gradient approach, also known as Fletcher-Reeves method in the literature, where the Hestenes-Stiefel rule is used to calculate the step size.
- Incremental gradient descent, also known as incremental gradient methods or stochastic gradient descent in the literature.

See the Official MADlib documentation for more information.

Input

A data set that contains the dependent and independent variables for modeling.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADlib Schema Name	Schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set. If a "madlib" schema exists in the database, this parameter defaults to madlib.
Model Output Schema Name	Name of the schema where the output is stored.
Model Output Table	Name of the table that is created to store the regression model. Specifically, the model output table stores: `[ group_col_1 \| group_col_2 \| ... \|] coef \| log_likelihood \| std_err \| z_stats \| p_values \| odds_ratios \| condition_no \| num_iterations` See the official MADlib logistic regression documentation for more information.
Drop If Exists	If Yes (the default), drop the existing table of the same name and create a new one. If No, stop the flow and alert the user that an error has occurred.
Dependent Variable	Must be a Boolean value to model or predict. The list of the available data columns for the Regression operator is displayed. Select the data column to be the dependent variable for the regression.
Independent Variables	Specifies the independent variable data columns to include for the regression analysis or model training. You must specify at least one column. Click Select Columns to open the dialog box for selecting the available columns from the input data set for analysis.
Grouping Columns	Specifies at least one column to group the input data and build separate regression models for each group. Click Select Columns to open the dialog box for selecting the available columns from the input data set for grouping.
Maximum Iterations	The computation stops after the number of iterations is greater than the Maximum Iterations or the difference between log-likelihood values in successive iterations is less than the Convergence Tolerance.
Optimizer	Computes the model, which can be one of the following algorithms. Iteratively Reweighted Least Squares Conjugate-Gradient, also known as Fletcher-Reeves method in the literature, where the Hestenes-Stiefel rule for calculating the step size is used. Incremental Gradient Descent, also known as incremental gradient methods or stochastic gradient descent in the literature.
Convergence Tolerance	The difference between log-likelihood values in successive iterations that indicate convergence. A zero disables the convergence criterion, so that execution stops after the maximum number of iterations is complete, as set in Maximum Iterations.
Verbosity	Set to true (the default) to log all SQL console output of the results of training.

Output

Visual Output

Output is displayed in a single tab. For further output and assessment of the quality of the Logistic Regression model, add ROC and Lift operators, in addition to the required Logistic Regression Prediction operator.

The Logistic Regression (MADlib) operator output includes the coefficients (beta) of the model, the Odds Ratio, the standard error (SE), the Z-value, and the P-value statistics.

Data Output

None. This is a terminal operator.

Contents

Index

Search Results

Logistic Regression - MADlib

Information at a Glance

Algorithm

Input

Configuration

Output