Generalized Linear Regression Models

Fits a regression model to predict a dependent variable that follows some distribution from the exponential family of distributions.

GLM operator icon

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

For example, if you have the National Transportation Safety Board's data set of the number of auto accidents by states in a year, you could use the Poisson distribution to fit a model that can predict future accident counts based on the predictor variables available in the data set. Team Studio leverages the Mllib implementation of generalized linear regression, so you should have Spark version 2.0 or later.

You can connect this operator to the Predictor Operator to obtain predictions on new data.

Input

A tabular input on Hadoop. The input should contain at least one numeric column that represents the dependent variable, and any number of columns that represent the independent variables. The operator one-hot encodes all of the string columns selected as independent variables.

Bad or Missing Values
The training example is dropped from the data set if any of the predictors or the dependent variable is missing.

Restrictions

To run binomial regression on a string-dependent column, you must first string index the column to produce a numeric-dependent column.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Distribution Family Distribution of the dependent variable.
  • gaussian (the default)
  • binomial
  • poisson
  • gamma
Link Function The link function that defines the relationship between the expected value of the dependent variable and the linear predictor.
  • cloglog
  • identity
  • inverse
  • log (the default)
  • logit
  • probit
  • sqrt
Dependent Column A numeric column to use as the output.
Independent Columns One or more columns to serve as input.

Spark currently supports up to 4096 features.

Max. Iterations Number of iterations the IRLS solver should perform.

Default value: 100.

Convergence Tolerance The integer value in this field is used as the (negative) exponent of a base-10 constant (for example, 4 evaluates to 10E-4) to check for convergence of the IRLS procedure.
Regularization Parameter A regularization parameter to perform a constrained optimization to overcome overfitting.

The default value of 0.0 indicates unconstrained fit.

Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
The visual output includes Summary, Goodness of Fit, and Parameter Estimates tables.
Parameter Estimates
The following image shows the Parameter Estimates table and the associated fit statistics, with t representing the student's t statistic and p representing probability.

Goodness of Fit
The following image shows the Goodness of Fit table.