Modeling Operators

Modeling Algorithm (Model) operators define the modeling method, or mathematical calculations, to apply to an input dataset.

Team Studio algorithms, or modeling approaches, use historical data from the input data to produce a predictive model, known as model training. Each Modeling operator is associated with a Predictor/Classifier operator and, together, they can be applied to other datasets within an analytic workflow in order to predict future results for that dataset.

Alpine Forest - MADlib
Uses the MADlib built-in function, forest_train(), to generate multiple decision trees, the combination of which is used to make a prediction based on several independent columns.
Alpine Forest Classification
An Alpine Forest Classification model is an ensemble classification method of creating a collection of decision trees with controlled variation. Ensemble modeling is the application of many models, each operating on a subset of the data.
Alpine Forest Predictor - MADlib
Uses the model trained by Alpine Forest (MADlib) and scores the results. It must be connected to an Alpine Forest (MADlib) operator.
Alpine Forest Regression
Applies an ensemble algorithm to make a numerical prediction by aggregating (majority vote or averaging) the numerical regression tree predictions of the ensemble.
ARIMA Time Series (DB)
Applies the ARIMA algorithm to an input time series data set and generates step forecasts for simulation or predictive modeling needs.
ARIMA Time Series (HD)
Applies the ARIMA algorithm to an input time series data set and generates step forecasts for simulation or predictive modeling needs.
Association Rules
Association Rules modeling refers to the process of determining patterns that occur frequently within a data set, such as identifying frequent combinations or sets of items bunched together, subsequences or substructures within the data.
Collaborative Filter Trainer
Collaborative filtering is used commonly for recommender systems. Given input data for users, products, and ratings, the Collaborative Filtering Trainer uses an alternating least squares (ALS) method, in which users and products are described by a small set of latent factors that can be used to predict unknown or empty entries in the sparse matrix.
Decision Tree
Applies a classification modeling algorithm to a set of input data. The Decision Tree operator has three configuration phases: tree growth, pre-pruning, and pruning.
Decision Tree - MADlib
Team Studio supports the MADlib Decision Tree model implementation.
Decision Tree Classification - CART
Uses the MADlib built-in function tree_train() to generate a decision tree that predicts the value of a categorical column based on several independent columns.
Decision Tree Regression - CART
Generates a decision tree that predicts the value of a numeric column based on several independent columns.
Elastic Net Linear - MADlib
Team Studio supports the MADlib open-source implementation of the Elastic Net Linear Regression algorithm. This operator implements MADlib's open-source elastic net regularization algorithm for linear regression problems.
Elastic Net Logistic - MADlib
Team Studio supports the MADlib implementation of the Elastic Net Logistic Regression algorithm.
Generalized Linear Regression Models
Fits a regression model to predict a dependent variable that follows some distribution from the exponential family of distributions.
Gradient Boosting Classification
A predictive method by which a series of shallow decision trees incrementally reduce prediction errors of previous trees. This method can be used for both classification and regression.
Gradient Boosting Regression
A predictive method by which a series of shallow decision trees incrementally reduce prediction errors of previous trees. This method can be used for both regression and classification.
K-Means (DB)
K-Means configuration is a data set that contains the various attribute values of the data members to use as clustering or partitioning criteria.
K-Means (HD)
K-Means configuration is a data set that contains the various attribute values of the data members to use as clustering or partitioning criteria.
K-Means Clustering - MADlib
Team Studio supports the MADlib K-Means Clustering model implementation.
Linear Regression (HD)
Use the Linear Regression operator to fit a trend line to an observed data set, in which one of the data values - the dependent variable - is linearly dependent on the value of the other causal data values or variables - the independent variables.
Linear Regression (DB)
Use the Linear Regression operator to fit a trend line to an observed data set, in which one of the data values - the dependent variable - is linearly dependent on the value of the other causal data values or variables - the independent variables.
Linear Regression - MADlib
Team Studio supports the MADlib open source implementation of the Linear Regression algorithm.
Logistic Regression (DB)
The Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event based on the values of a set of independent variables.
Logistic Regression (HD)
The Logistic Regression operator fits an s-curve logistic or logit function to a data set to calculate the probability of the occurrence of a specific categorical event based on the values of a set of independent variables.
Logistic Regression - MADlib
The binomial Logistic Regression (MADlib) operator models the relationship between a dichotomous dependent variable and one or more predictor variables.
Naive Bayes (DB)
The Naive Bayes operator calculates the probability of a particular event occurring. It can be used to predict the probability of a certain data point being in a particular classification.
Naive Bayes (HD)
The Naive Bayes operator calculates the probability of a particular event occurring. It can be used to predict the probability of a certain data point being in a particular classification.
Neural Network
Implements the Spark MLlib MultiLayer Perceptron Classifier (MLPC), a feedforward neural network that consists of multiple layers of nodes in a directed graph, each layer fully connected to the next one in the network.
PCA (DB)
Uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables (principal components).
PCA (HD)
Uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables (principal components).
SVM Classification
Classifies data (both linear and non-linear) by clustering it into the most distant and distinct groups possible.

Related concepts

Sampling Operators

Transformation Operators

Exploration Operators

Data Operators

NLP Operators

Prediction Operators

Model Validation Operators

Tool Operators

Contents

Index

Search Results

Modeling Operators