Elastic Net Logistic Regression
The Elastic-Net Logistic Regression operator applies the elastic-net logistic regression algorithm to the input data set. This operator supports the open source implementation of the elastic-net regularized logistic regression algorithm.
Information at a Glance
Parameter |
Description |
---|---|
Category | Model |
Data source type | TIBCO® Data Virtualization |
Send output to other operators | Yes |
Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Algorithm
This Elastic-Net Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event, based on the values of a set of independent variables. This operator implements the logistic regression in Spark 3.2.0.
The logistic regression analysis predicts the odd outcomes of a categorical variable based on one or more predictor variables. This logistic regression operator implements the Spark MLlib open-source regularized logistic regression algorithm, optimized with L-BFGS for classification problems. This operator is used to optimize the hyper-parameters of logistic regression with a cross-validation method. The output is the Spark Logistic Regression Classification model with the best validation performance.
Input
An input is a single tabular data set.
Bad or Missing Values
Configuration
The following table provides the configuration details for the Elastic-Net Logistic Regression operator.
Parameter | Description |
---|---|
Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
Dependent Variable | Specify the categorical data column as a dependent column. It must be numerical and the value cannot be a label or class. |
Use all available columns as Predictors | When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors. |
Continuous Predictors |
Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns. Note:
The columns selected in the Categorical Predictors parameter are not available. |
Categorical Predictors |
Specify the categorical data columns as independent columns. Note:
The columns selected in the Continuous Predictors parameter are not available. |
Normalize Numerical Features |
Specify whether to normalize numerical features using Z-Transformation. Default: Yes |
Evaluation Metric |
The metric for evaluating model performance during cross-validation training. For more information, see the Spark documentation on multinomial logistic regression. The following values are:
If you select Auto, the operator uses Accuracy for binary classification and FMeasure for multiclass classification. Note: The value of the beta parameter for FMeasure is set to 1.
Default: Auto |
Iterations |
Specify the maximum number of iterations for each grid of parameters. Default: 100. |
Tolerance |
Specify the convergence tolerance. Default: 0.01 |
Penalizing Parameter (λ) |
The λ parameter grid for Lasso Logistics Regression. For more information, see Multinomial logistic regression in the Apache Spark documentation. The valid value is a comma-separated sequence of values, such as V1, V2, and V3) representing start, end, and count. It is recommended that the values of lambda span different orders of magnitude. In the case of start:end: count, create an exponential grid of n lambda values from start to end.
Default: 0.0, 0.5, 1.0 |
Elastic Parameter (α) |
The parameter to control the ElasticNet parameter.
For more information, see Linear Methods - RDD-based API. The valid value is a comma-separated sequence of values, such as V1, V2, and V3, representing start, end, and step. If start > end, then "Not valid; start value of alpha is greater than the end value" is returned. If step > (end - start), then "Not valid; check the step value" is returned. Default: 0.0, 0.5, 1.0 |
Number of Cross Validation Folds |
Specify the number of cross-validation samples. Default: 3 |
Random Seed |
Specify the seed used for the pseudo-random row extraction. Default: 1 |
Output
-
Parameter Summary Info: Displays a list of the input parameters and their current settings.
-
Coefficients: For multiclass target, displays the coefficients for each value to predict and the reference class. For a binary classification task, displays the coefficients for value to predict (non-reference class.)
-
Training Summary: Displays a table with a row for each tested combination of hyper-parameters. For each hyper-parameter, the chosen metric is displayed and the Best Model is marked.
-
Additional Model Info: Displays the information of the levels within the dependent column and the reference categories of the logistic regression model.
-
Objective History: Displays the objective function history during training. In our implementation, the objective function is Log Loss (negative Log Likelihood). For more information, see Multinomial logistic regression in the Apache Spark documentation.
A model object that can be used with a Predictor operator. Three columns are produced in the Predictor operator.
- PRED_LOR: The value predicted by the classification model.
- CONF_LOR: The probability of the predicted classification.
- INFO_LOR: The probability of each class prediction.
Example
The following example demonstrates the Elastic-Net Logistic Regression operator.
golf: This data set contains the following information:
- Multiple columns namely outlook, temperature, wind, humidity, and play.
- Multiple rows (14 rows).
The parameter settings for the golf data set are as follows:
-
Dependent Variable: play
-
Use all available columns as Predictors: No
-
Continuous Predictors: temperature, humidity
-
Categorical Predictors: wind
-
Normalize Numerical Features: Yes
-
Evaluation Metric: Auto
-
Iterations: 100
-
Tolerance: 0.01
-
Penalizing Parameter (λ): 0.0, 0.5, 0.2
-
Elastic Parameter (α): 0.0, 0.5, 0.1
- Number of Cross Validation Folds: 3
- Random Seed: 1
These figures displays the results for the parameter settings for the golf data set.