Elastic Net Logistic

Information at a Glance

Category	Model
Data source type	DB
Sends output to other operators	Yes
Data processing tool	MADlib

Algorithm

This module implements elastic net regularization for logistic regression problems. Elastic net regularization seeks to find a weight vector that, for any given training example set, minimizes a metric function that combines the L1 and L2 penalties of the lasso and ridge regression methods.

For more information, including general principles, see the official MADlib documentation.

Input

A database data set that contains the dependent and independent variables for modeling. The data set must have at least one Boolean column and at least one numeric type column.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADLib Schema Name	The schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set. If a "madlib" schema exists in the database, this parameter defaults to madlib.
Model Output Schema Name	The name of the model output schema.
Model Output Table Name	The name of the table that is created to store the regression model. The model output table stores the following. `family \| features \| features_selected \| coef_nonzero \| coef_all \| intercept \| log_likelihood \| standardize \| iteration_run` See the official MADlib elastic net regularization documentation for more information.
Drop If Exists	If Yes (the default), drop the existing table of the same name and create a new one. If No, stop the flow and alert the user that an error has occurred.
Dependent Variable	The quantity to model or predict. This must be a Boolean column. The list of the available data columns for the Elastic Net Logistic operator is displayed. Select the data column to be considered the dependent variable for the regression.
Independent Variables	Select the data columns to include for the regression analysis or model training. At least one column or one interaction must be specified. Click Select Columns to open the dialog box for selecting the available columns from the input data set for analysis. See: Select Columns Dialog Box.
Control Parameter	The elastic net Control Parameter (alpha) must be a value in [0,1].
Regularization Parameter	The Regularization Parameter (lambda) must be a positive value.
Standardize	A Boolean flag that specifies the option to normalize the data. The default value is true, which often yields better results and faster convergence.
Optimizer	Can be fista (Fast Iterative Shrinkage-Thresholding Algorithm) or igd (Incremental Gradient Descent). The required proceeding optimizer configuration parameters are dependent on the optimizer selected. See the official MADlib elastic net regularization documentation for more information.
FISTA Maximum Stepsize	The initial backtracking step size. At each iteration, the algorithm first tries `stepsize = max_stepsize` , and if it does not work, it tries a smaller step size, `stepsize = stepsize/eta`, where eta must be larger than 1. At first glance, this seems to perform repeated iterations for even one step, but using a larger step size actually greatly increases the computation speed and minimizes the total number of iterations. A careful choice of `max_stepsize` can decrease the computation time by more than 10 times. Default value: 4.0.
FISTA Eta	If `stepsize` does not work, `stepsize /eta` is tried. This value must be greater than 1. Default value: 2.0.
Warmup	If true, a series of lambda values, which is strictly descent and ends at the lambda value that the user wants to calculate, is used. The larger lambda gives a very sparse solution, and the sparse solution again is used as the initial guess for the next lambda's solution, which speeds up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and might be faster than computation on only one lambda value. If false (the default), this warmup procedure is not performed.
Warmup Lambdas	The lambda value series to use when warmup is true. The default is NULL, which means that lambda values are automatically generated.
Number of Warmup Lambdas	The number of lambdas used in warmup. If `warmup_lambdas` is not NULL, this value is overridden by the number of provided lambda values. Default value: 15.
Warmup Tolerance	The value of tolerance used during warmup. Default value: 1e-6.
FISTA Use Active Method	If true, an active-set method is used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features - those with nonzero coefficients. After a complete cycle through all the variables, we iterate on only the active set until convergence. If another complete cycle does not change the active set, we are done, otherwise the process is repeated. If false (the default), the active-set method is not used.
FISTA Active Tolerance	The value of tolerance used during active set calculation. Default value: 1e-6.
FISTA Random Step Size	Specify whether to add some randomness to the step size. Sometimes, this can speed up the calculation. Default value: 1e-6.
IGD Step Size	The initial backtracking step size. Default value: 0.01.
IGD Zero Coefficient Threshold	When a coefficient is very small, set this coefficient to 0. Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, threshold is needed at the end of the computation to screen out tiny values and hard-set them to zeros. This is accomplished as follows. Multiply each coefficient with the standard deviation of the corresponding feature. Compute the average of absolute values of re-scaled coefficients. Divide each rescaled coefficient with the average, and if the resulting absolute value is smaller than threshold, set the original coefficient to zero. Default value: 1e-10.
IGD Paralellize	Specify whether to run the computation on multiple segments. SGD is a sequential algorithm in nature. When running in a distributed manner, each segment of the data runs its own SGD model and then the models are averaged to get a model for each iteration. This averaging might slow down the convergence speed, although we also acquire the ability to process large data sets on multiple machines. This algorithm, therefore, provides the parallel option to allow you to choose whether to do parallel computation. Default value: true.
Maximum Iterations	When the difference between coefficients of two consecutive iterations is smaller than the convergence tolerance or the iteration number is larger than Maximum Iterations, the computation stops. Default value: 10000.
Convergence Tolerance	When the difference between coefficients of two consecutive iterations is smaller than the Convergence Tolerance or the iteration number is larger than Maximum Iterations, the computation stops. Default value: 1e-6.

Output

Visual Output

Output is displayed on the Summary tab, which displays the features selected, non-zero coefficients, all coefficients, log likelihood, and number of iterations run until termination. Additional assessment is often done using ROC, LIFT, and Logistic Regression Prediction Operator results.

See the Official MADlib documentation for elastic net regularization for more information.

Data Output

None.

Contents

Index

Search Results

Elastic Net Logistic - MADlib

Information at a Glance

Algorithm

Input

Configuration

Output