Logistic Regression - MADlib
The binomial Logistic Regression (MADlib) operator models the relationship between a dichotomous dependent variable and one or more predictor variables.
Algorithm
- The dependent variable is a Boolean value that can be represented with a Boolean expression.
- (Binomial) logistic regression refers to a stochastic model in which the conditional mean of the dependent dichotomous variable is the logistic function of an affine function of the vector of the independent variables.
- Logistic regression finds the vector of coefficients that maximizes the likelihood of the observations.
- Currently, logistic regression in MADlib can use one of the following three algorithms.
- Iteratively Reweighted Least Squares
- A conjugate-gradient approach, also known as Fletcher-Reeves method in the literature, where the Hestenes-Stiefel rule is used to calculate the step size.
- Incremental gradient descent, also known as incremental gradient methods or stochastic gradient descent in the literature.
See the Official MADlib documentation for more information.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
MADlib Schema Name | Schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set.
If a "madlib" schema exists in the database, this parameter defaults to madlib. |
Model Output Schema Name | Name of the schema where the output is stored. |
Model Output Table | Name of the table that is created to store the regression model. Specifically, the model output table stores:
[ group_col_1 | group_col_2 | ... |] coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations
See the official MADlib logistic regression documentation for more information. |
Drop If Exists | |
Dependent Variable |
Must be a Boolean value to model or predict. The list of the available data columns for the Regression operator is displayed. Select the data column to be the dependent variable for the regression. |
Independent Variables | Specifies the independent variable data columns to include for the regression analysis or model training. You must specify at least one column. Click Select Columns to open the dialog box for selecting the available columns from the input data set for analysis. |
Grouping Columns | Specifies at least one column to group the input data and build separate regression models for each group. Click Select Columns to open the dialog box for selecting the available columns from the input data set for grouping. |
Maximum Iterations | The computation stops after the number of iterations is greater than the Maximum Iterations or the difference between log-likelihood values in successive iterations is less than the Convergence Tolerance. |
Optimizer | Computes the model, which can be one of the following algorithms.
|
Convergence Tolerance | The difference between log-likelihood values in successive iterations that indicate convergence. A zero disables the convergence criterion, so that execution stops after the maximum number of iterations is complete, as set in Maximum Iterations. |
Verbosity | Set to true (the default) to log all SQL console output of the results of training. |
Output
- Visual Output
- Output is displayed in a single tab. For further output and assessment of the quality of the Logistic Regression model, add
ROC and
Lift operators, in addition to the required Logistic Regression Prediction operator.
The Logistic Regression (MADlib) operator output includes the coefficients (beta) of the model, the Odds Ratio, the standard error (SE), the Z-value, and the P-value statistics.
- Data Output
- None. This is a terminal operator.