Generalized Linear Regression Models

GLM operator icon

Information at a Glance

Category	Model
Data source type	HD
Sends output to other operators	Yes
Data processing tool	Spark

For example, if you have the National Transportation Safety Board's data set of the number of auto accidents by states in a year, you could use the Poisson distribution to fit a model that can predict future accident counts based on the predictor variables available in the data set. Team Studio leverages the Mllib implementation of generalized linear regression, so you should have Spark version 2.0 or later.

You can connect this operator to the Predictor Operator to obtain predictions on new data.

Input

A tabular input on Hadoop. The input should contain at least one numeric column that represents the dependent variable, and any number of columns that represent the independent variables. The operator one-hot encodes all of the string columns selected as independent variables.

Bad or Missing Values: The training example is dropped from the data set if any of the predictors or the dependent variable is missing.

Restrictions

To run binomial regression on a string-dependent column, you must first string index the column to produce a numeric-dependent column.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Distribution Family	Distribution of the dependent variable. gaussian (the default) binomial poisson gamma
Link Function	The link function that defines the relationship between the expected value of the dependent variable and the linear predictor. cloglog identity inverse log (the default) logit probit sqrt
Dependent Column	A numeric column to use as the output.
Independent Columns	One or more columns to serve as input. Spark currently supports up to 4096 features.
Max. Iterations	Number of iterations the IRLS solver should perform. Default value: 100.
Convergence Tolerance	The integer value in this field is used as the (negative) exponent of a base-10 constant (for example, 4 evaluates to 10E-4) to check for convergence of the IRLS procedure.
Regularization Parameter	A regularization parameter to perform a constrained optimization to overcome overfitting. The default value of 0.0 indicates unconstrained fit.
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output

The visual output includes Summary, Goodness of Fit, and Parameter Estimates tables.

Parameter Estimates: The following image shows the Parameter Estimates table and the associated fit statistics, with t representing the student's t statistic and p representing probability.
Goodness of Fit: The following image shows the Goodness of Fit table.

Contents

Index

Search Results