Generalized Linear Regression Models
Fits a regression model to predict a dependent variable that follows some distribution from the exponential family of distributions.
Information at a Glance
For example, if you have the National Transportation Safety Board's data set of the number of auto accidents by states in a year, you could use the Poisson distribution to fit a model that can predict future accident counts based on the predictor variables available in the data set. Team Studio leverages the Mllib implementation of generalized linear regression, so you should have Spark version 2.0 or later.
You can connect this operator to the Predictor Operator to obtain predictions on new data.
Input
A tabular input on Hadoop. The input should contain at least one numeric column that represents the dependent variable, and any number of columns that represent the independent variables. The operator one-hot encodes all of the string columns selected as independent variables.
Restrictions
To run binomial regression on a string-dependent column, you must first string index the column to produce a numeric-dependent column.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
Distribution Family | Distribution of the dependent variable. |
Link Function | The link function that defines the relationship between the expected value of the dependent variable and the linear predictor. |
Dependent Column | A numeric column to use as the output. |
Independent Columns |
One or more columns to serve as input.
Spark currently supports up to 4096 features. |
Max. Iterations | Number of iterations the
IRLS solver should perform.
Default value: 100. |
Convergence Tolerance | The integer value in this field is used as the (negative) exponent of a base-10 constant (for example, 4 evaluates to 10E-4) to check for convergence of the IRLS procedure. |
Regularization Parameter | A regularization parameter to perform a constrained optimization to overcome overfitting.
The default value of 0.0 indicates unconstrained fit. |
Advanced Spark Settings Automatic Optimization |
|