Elastic-Net Linear Regression

The Elastic-Net Linear Regression operator applies the elastic-net linear regression algorithm to the input data set. This operator supports the open source implementation of the elastic-net regularized linear regression algorithm.

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter	Description
Category	Model
Data source type	TIBCO® Data Virtualization
Send output to other operators	Yes
Data processing tool	TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Elastic-Net Linear Regression operator fits a trend line to an observed data set where one of the data values (dependent variable) is linearly dependent on the value of the other causal data values or variables (independent variables). This operator implements the elastic-net linear regression in Spark MLlib.

A Penalizing parameter (lambda) and the Elastic parameter (alpha) are applied to prevent the chances of overfitting the model. You can use this operator to optimize the best combination of alpha and lambda with a cross-validation method. This operator is limited by the cluster resources and Spark data frame size. One-hot encoding of high-cardinality categorical columns is limited by Spark cluster size.

Input

An input is a single tabular data set.

Bad or Missing Values

Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the Elastic-Net Linear Regression operator.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable	Specify the categorical data columns as the dependent column. It must be numerical and the value cannot be a label or class.
Use all available columns as Predictors	When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors	Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns. Note: The columns selected in the Categorical Predictors parameter are not available.
Categorical Predictors	Specify the categorical data columns as independent columns. Note: The columns selected in the Continuous Predictors parameter are not available.
Normalize Numerical Features	Specify whether to normalize numerical features using Z-Transformation. Default: Yes
Evaluation Metric	Specify the metric for evaluating the regression models such as MAE, MSE, R2, and RMSE. Default: RMSE
Iterations	Specify the maximum number of iterations for each grid of parameters. Default: 100
Tolerance	Specify the convergence tolerance. Default: 0.01
Penalizing Parameter (λ)	The λ parameter grid for Lasso Regression. For more information, see Classification and Regression in the Apache Spark documentation. The following values are valid: A single value. A comma-separated sequence of values(such as `V1`, `V2`, `V3`). An interval notation following the pattern `start: end: count(n)`. Values of λ should span different orders of magnitude. In the case of `start: end: count(n)`, Team Studio creates an exponential grid of `n` λ values from `start` to `end`. If `start` > `end`, then "Not valid, `start` value of λ is greater than the `end` value" is returned. If `count` is not an integer, then "Not valid; `count` should be an integer" is returned. If `count` < 2, then "Not valid; count should be at least 2" is returned. Default: 0.0, 0.5, 2
Elastic Parameter (α)	The parameter to control the ElasticNet parameter. When α = 0, then the penalty is an L2 penalty. When α = 1, then the penalty is an L1 penalty. For more information, see Linear Methods - RDD-based API. The following are valid values: A single value. A comma-separated sequence of values, such as `V1`, `V2`, and `V3`. An interval notation following the pattern `start: end: step`. If `start` > `end`, then "Not valid; start value of alpha is greater than the end value" is returned. If `step` > (`end` - `start`), then "Not valid; check the step value" is returned. Default: 0.0, 0.5, 0.1
Number of Cross Validation Folds	Specify the number of cross-validation samples. Default: 3
Random Seed	Specify the seed used for the pseudo-random row extraction. Default: 1

Output

Visual Output

Parameter Summary Info: Displays a list of the input parameters and their current settings.
Coefficients: Displays the coefficient of the model.
Training Summary: Displays a table with a row for each tested combination of hyper-parameters. For each hyper-parameter, the chosen metric is displayed and the Best Model is marked.
Objective History: Displays a table showing the evolution of the objective function value during optimization. The objective function is the elastic-net objective function of linear regression with selected alpha and lambda:

For Spark ML elastic-net linear regression, the default optimization method is L-BFGS, a numerical optimization algorithm. The training process stops once the difference between two consecutive iterations is smaller than the user-specified tolerance.

In the case the exact solution of the objective function is obtained, the optimization method is the normal equation method. For example, when the value of lambda is 0, regularization is not applied to the objective function. In this case, the elastic-net linear regression is equivalent to the ordinary linear regression. An exact solution is obtained by the normal equation method. In another case, when alpha is 0, the exact solution is derivable. In this scenario, the elastic-net loss is equivalent to the ridge loss. The ridge loss is a convex function where a unique solution exists. In this case, the normal equation method is applied.

For more information, see Optimization of linear methods.