Naive Bayes

The Naive Bayes operator calculates the probability of a particular event occurring. It is used to predict the probability of a certain data point being in a particular classification.

Naive Bayes

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Model
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Naive Bayes classifier calculates the probability of an event occurring. It combines Bayes' theorem with an assumption of strong independence among the predictors. Bayes' theorem calculates the probability of occurrence given a prior event has occurred. Regardless of actuality, a Naive Bayes classifier considers the influence of predictors on the outcome independently.

  • The TIBCO Data Science – Team Studio Naive Bayes Operator computes the dependent variable's class priors and each of the independent variable's probability distributions using the Naive Bayes conditional probability theorem with the independence assumption.
  • As an overview, the Naive Bayes conditional probability theorem says that, given a data set ( X), and an outcome Hypothesis ( H), the posterior probability that the Hypothesis is true is proportional to the product of the likelihood multiplied by the prior probability.
  • Depending on the precise nature of the probability model, the Naive Bayes classifiers can be trained very efficiently in a supervised learning setting.
  • Given some data and some hypothesis, the posterior probability that the hypothesis is true is proportional to the product of the likelihood multiplied by the prior probability.
  • For simplicity, the "prior probability" is often abbreviated as the "prior" and the "posterior probability" as the "posterior".
  • The likelihood brings in the effect of the data, while the prior specifies the belief in the hypothesis before the data was observed.

More formally, Bayes' formula for conditional probability is represented as,

conditional probability formula

where,

  • P(H|X) is the conditional probability of outcome H happening given condition X ,
  • P(X|H) is the conditional probability of the outcome X happening given condition H ,
  • P(H) is the prior observed probability of the outcome H happening,
  • P(X) is the prior observed probability of the outcome X happening.

This Bayes formula is helpful because it provides a way to calculate the Posterior probability ( P(H|X)), from P(H), P(X|H), and P(X) which can be calculated from historic data.

The Naive Bayes conditional independence assumption formula is as follows:

posterior probability

If the feature is a continuous value, the conditional distribution over the class variable C is expressed as follows:

conditional distribution over the class variable C

  • This formula describes the ideal normal distribution curve for each independent variable's value.

    Note:

    This is a simplification assumption since most of the independent variables are likely to have exactly normal distributions.

  • However, the Naive Bayes model predictions are still quite accurate with an acceptable level of confidence.
  • The Naive Bayes Operator can accept a dependent column that has two or more discrete categories.

    Note: If the dependent variable is a numeric integer, each integer is treated as a separate category.
  • The independence assumption treats all the predictors or variables as independently related to the outcome.
  • The Naive Bayes theorem results give the normal probability curve of each possible categorical value occurring for that variable.

This operator implements the Naive Bayes algorithm from Spark MLLib.

Input

An input is a single tabular data set.

Bad or Missing Values

  • Null values are not allowed and result in an error.

Configuration

If the Use all available columns as Predictors parameter is set to Yes, the operator uses all available columns as predictors, or else the specified Continuous and Categorical predictors are used. It permits you to specify the event model type and lambda parameters. The following table includes the configuration details for the Naive Bayes operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Dependent Variable Specify the categorical data column as a dependent column.
Use all available columns as Predictors When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors

Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns.

Note:

The columns selected in the Categorical Predictors parameter are not available.

Categorical Predictors

Specify the categorical data columns as independent columns.

Note:

The columns selected in the Continuous Predictors parameter are not available.

Model Type

The event model type is supported by Naive Bayes. The following values are available:

  • Multinomial

  • Complement

  • Bernoulli

  • Gaussian

Default: Multinomial

Note:
  • The feature values for the Multinomial and Complement models must be non-negative (greater than or equal to 0) values.

  • The feature value for the Bernoulli model must be either 0 or 1.

For more information, see the Apache Spark documentation.

Lambda

Specify the additive smoothing parameter. The value must be non-negative (greater than or equal to 0).

Default: 1.0

Output

Visual Output
  • Parameter Summary Info: Displays information about the input parameters and their current settings.
  • Training Summary: Displays a table containing data for the dependent variable and for each of the categorical and continuous predictors. The dependent variable data represents the prior probability of each label.

    For the Bernoulli, Complement, and Multinomial model types, the predictor data shows the conditional probability distribution of each predictor. For the Gaussian model type, the data represents the exponential (exp) of the mean value for each predictor.

Output to successive operators
A model object that can be used with a Predictor operator. The additional three columns are produced in the Predictor operator:
  • PRED_NB: The predictive value of the classification model.
  • CONF_NB: The probability of the predicted value.
  • INFO_NB: Overall probabilities for each class.

Example

The following example demonstrates the Naive Bayes operator.

Workflow of Naive Bayes Operator

Data

golf: This data set contains the following information:

  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

  • Dependent Variable: play

  • Use all available columns as Predictors: No

  • Continuous Predictors: temperature,humidity

  • Categorical Predictors: outlook,wind

  • Model Type: multinomial

  • Lambda: 1.0

Results

These figures displays the results for the parameter settings for the golf data set.

Parameter Summary Info

Parameter Summary Info

Training Summary

Training Summary result