Elastic Net Logistic Regression

The Elastic-Net Logistic Regression operator applies the elastic-net logistic regression algorithm to the input data set. This operator supports the open source implementation of the elastic-net regularized logistic regression algorithm.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Model
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

This Elastic-Net Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event, based on the values of a set of independent variables. This operator implements the logistic regression in Spark 3.2.0.

The logistic regression analysis predicts the odd outcomes of a categorical variable based on one or more predictor variables. This logistic regression operator implements the Spark MLlib open-source regularized logistic regression algorithm, optimized with L-BFGS for classification problems. This operator is used to optimize the hyper-parameters of logistic regression with a cross-validation method. The output is the Spark Logistic Regression Classification model with the best validation performance.

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the Elastic-Net Logistic Regression operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable Specify the categorical data column as a dependent column. It must be numerical and the value cannot be a label or class.
Use all available columns as Predictors When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors

Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns.

Note:

The columns selected in the Categorical Predictors parameter are not available.

Categorical Predictors

Specify the categorical data columns as independent columns.

Note:

The columns selected in the Continuous Predictors parameter are not available.

Normalize Numerical Features

Specify whether to normalize numerical features using Z-Transformation.

Default: Yes

Evaluation Metric

The metric for evaluating model performance during cross-validation training. For more information, see the Spark documentation on multinomial logistic regression.

The following values are:

  • Auto
  • FMeasure
  • Accuracy

If you select Auto, the operator uses Accuracy for binary classification and FMeasure for multiclass classification.

Note: The value of the beta parameter for FMeasure is set to 1.

Default: Auto

Iterations

Specify the maximum number of iterations for each grid of parameters.

Default: 100.

Tolerance

Specify the convergence tolerance.

Default: 0.01

Penalizing Parameter (λ)

The λ parameter grid for Lasso Logistics Regression. For more information, see Multinomial logistic regression in the Apache Spark documentation.

The valid value is a comma-separated sequence of values, such as V1, V2, and V3) representing start, end, and count. It is recommended that the values of lambda span different orders of magnitude. In the case of start:end: count, create an exponential grid of n lambda values from start to end.

  • If start > end, then "Not valid, start value of λ is greater than the end value" is returned.

  • If count is not an integer, then "Not valid; count should be an integer" is returned.

  • If count < 2, then "Not valid; count should be at least 2" is returned.

Default: 0.0, 0.5, 1.0

Elastic Parameter (α)

The parameter to control the ElasticNet parameter.

  • When α = 0, then the penalty is an L2 penalty.

  • When α = 1, then the penalty is an L1 penalty.

For more information, see Linear Methods - RDD-based API.

The valid value is a comma-separated sequence of values, such as V1, V2, and V3, representing start, end, and step.

If start > end, then "Not valid; start value of alpha is greater than the end value" is returned.

If step > (end - start), then "Not valid; check the step value" is returned.

Default: 0.0, 0.5, 1.0

Number of Cross Validation Folds

Specify the number of cross-validation samples.

Default: 3

Random Seed

Specify the seed used for the pseudo-random row extraction.

Default: 1

Output

Visual Output
  • Parameter Summary Info: Displays a list of the input parameters and their current settings.

  • Coefficients: For multiclass target, displays the coefficients for each value to predict and the reference class. For a binary classification task, displays the coefficients for value to predict (non-reference class.)

  • Training Summary: Displays a table with a row for each tested combination of hyper-parameters. For each hyper-parameter, the chosen metric is displayed and the Best Model is marked.

  • Additional Model Info: Displays the information of the levels within the dependent column and the reference categories of the logistic regression model.

  • Objective History: Displays the objective function history during training. In our implementation, the objective function is Log Loss (negative Log Likelihood). For more information, see Multinomial logistic regression in the Apache Spark documentation.

Output to successive operators

A model object that can be used with a Predictor operator. Three columns are produced in the Predictor operator.

  • PRED_LOR: The value predicted by the classification model.
  • CONF_LOR: The probability of the predicted classification.
  • INFO_LOR: The probability of each class prediction.

Example

The following example demonstrates the Elastic-Net Logistic Regression operator.

Workflow for Elastic Not Logistic Regression using TDV

Data

golf: This data set contains the following information:

  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

  • Dependent Variable: play

  • Use all available columns as Predictors: No

  • Continuous Predictors: temperature, humidity

  • Categorical Predictors: wind

  • Normalize Numerical Features: Yes

  • Evaluation Metric: Auto

  • Iterations: 100

  • Tolerance: 0.01

  • Penalizing Parameter (λ): 0.0, 0.5, 0.2

  • Elastic Parameter (α): 0.0, 0.5, 0.1

  • Number of Cross Validation Folds: 3
  • Random Seed: 1
Results

These figures displays the results for the parameter settings for the golf data set.

Parameter Summary Info

Elastic_Net_Logistic_Regression_Parameter Summary Info

Coefficients

Elastic_Net_Logistic_Regression_Coefficients

Training Summary

Elastic_Net_Logistic_Regression_Training Summary

Additional Model Info

Elastic_Net_Logistic_Regression_Additional Model Info

Objective History

Elastic_Net_Logistic_Regression_Objective History