Elastic Net Logistic Regression

The Elastic-Net Logistic Regression operator applies the elastic-net logistic regression algorithm to the input data set. This operator supports the open source implementation of the elastic-net regularized logistic regression algorithm.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Model
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Algorithm

This Elastic-Net Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event, based on the values of a set of independent variables. This operator implements the logistic regression in Spark 3.2.0.

The logistic regression analysis predicts the odd outcomes of a categorical variable based on one or more predictor variables. This logistic regression operator implements the Spark MLlib open-source regularized logistic regression algorithm, optimized with L-BFGS for classification problems. This operator is used to optimize the hyper-parameters of logistic regression with a cross-validation method. The output is the Spark Logistic Regression Classification model with the best validation performance.

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the Elastic-Net Logistic Regression operator.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable	Specify the categorical data column as a dependent column. It must be numerical and the value cannot be a label or class.
Use all available columns as Predictors	When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors	Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns. Note: The columns selected in the Categorical Predictors parameter are not available.
Categorical Predictors	Specify the categorical data columns as independent columns. Note: The columns selected in the Continuous Predictors parameter are not available.
Normalize Numerical Features	Specify whether to normalize numerical features using Z-Transformation. Default: Yes
Evaluation Metric	The metric for evaluating model performance during cross-validation training. For more information, see the Spark documentation on multinomial logistic regression. The following values are: Auto FMeasure Accuracy If you select Auto, the operator uses Accuracy for binary classification and FMeasure for multiclass classification. Note: The value of the beta parameter for FMeasure is set to 1. Default: Auto
Iterations	Specify the maximum number of iterations for each grid of parameters. Default: 100.
Tolerance	Specify the convergence tolerance. Default: 0.01
Penalizing Parameter (λ)	The λ parameter grid for Lasso Logistics Regression. For more information, see Multinomial logistic regression in the Apache Spark documentation. The valid value is a comma-separated sequence of values, such as `V1`, `V2`, and `V3`) representing `start`, `end`, and `count`. It is recommended that the values of lambda span different orders of magnitude. In the case of `start:end: count`, create an exponential grid of n lambda values from start to end. If `start` > `end`, then "Not valid, `start` value of λ is greater than the `end` value" is returned. If `count` is not an integer, then "Not valid; `count` should be an integer" is returned. If `count` < 2, then "Not valid; count should be at least 2" is returned. Default: 0.0, 0.5, 1.0
Elastic Parameter (α)	The parameter to control the ElasticNet parameter. When α = 0, then the penalty is an L2 penalty. When α = 1, then the penalty is an L1 penalty. For more information, see Linear Methods - RDD-based API. The valid value is a comma-separated sequence of values, such as `V1`, `V2`, and `V3`, representing `start`, `end`, and `step`. If `start` > `end`, then "Not valid; start value of alpha is greater than the end value" is returned. If `step` > (`end` - `start`), then "Not valid; check the step value" is returned. Default: 0.0, 0.5, 1.0
Number of Cross Validation Folds	Specify the number of cross-validation samples. Default: 3
Random Seed	Specify the seed used for the pseudo-random row extraction. Default: 1

Output

Visual Output

Parameter Summary Info: Displays a list of the input parameters and their current settings.
Coefficients: For multiclass target, displays the coefficients for each value to predict and the reference class. For a binary classification task, displays the coefficients for value to predict (non-reference class.)
Training Summary: Displays a table with a row for each tested combination of hyper-parameters. For each hyper-parameter, the chosen metric is displayed and the Best Model is marked.
Additional Model Info: Displays the information of the levels within the dependent column and the reference categories of the logistic regression model.
Objective History: Displays the objective function history during training. In our implementation, the objective function is Log Loss (negative Log Likelihood). For more information, see Multinomial logistic regression in the Apache Spark documentation.