Naive Bayes
The Naive Bayes operator calculates the probability of a particular event occurring. It is used to predict the probability of a certain data point being in a particular classification.
Information at a Glance
Parameter |
Description |
---|---|
Category | Model |
Data source type | TIBCO® Data Virtualization |
Send output to other operators | Yes |
Data processing tool | TIBCO® DV, Apache Spark 3.2 or later |
Algorithm
The Naive Bayes classifier calculates the probability of an event occurring. It combines Bayes' theorem with an assumption of strong independence among the predictors. Bayes' theorem calculates the probability of occurrence given a prior event has occurred. Regardless of actuality, a Naive Bayes classifier considers the influence of predictors on the outcome independently.
- The TIBCO Data Science – Team Studio Naive Bayes Operator computes the dependent variable's class priors and each of the independent variable's probability distributions using the Naive Bayes conditional probability theorem with the independence assumption.
- As an overview, the Naive Bayes conditional probability theorem says that, given a data set ( X), and an outcome Hypothesis ( H), the posterior probability that the Hypothesis is true is proportional to the product of the likelihood multiplied by the prior probability.
- Depending on the precise nature of the probability model, the Naive Bayes classifiers can be trained very efficiently in a supervised learning setting.
- Given some data and some hypothesis, the posterior probability that the hypothesis is true is proportional to the product of the likelihood multiplied by the prior probability.
- For simplicity, the "prior probability" is often abbreviated as the "prior" and the "posterior probability" as the "posterior".
- The likelihood brings in the effect of the data, while the prior specifies the belief in the hypothesis before the data was observed.
More formally, Bayes' formula for conditional probability is represented as,
where,
- P(H|X) is the conditional probability of outcome H happening given condition X ,
- P(X|H) is the conditional probability of the outcome X happening given condition H ,
- P(H) is the prior observed probability of the outcome H happening,
- P(X) is the prior observed probability of the outcome X happening.
This Bayes formula is helpful because it provides a way to calculate the Posterior probability ( P(H|X)), from P(H), P(X|H), and P(X) which can be calculated from historic data.
The Naive Bayes conditional independence assumption formula is as follows:
If the feature is a continuous value, the conditional distribution over the class variable C is expressed as follows:
-
This formula describes the ideal normal distribution curve for each independent variable's value.
Note:This is a simplification assumption since most of the independent variables are likely to have exactly normal distributions.
- However, the Naive Bayes model predictions are still quite accurate with an acceptable level of confidence.
-
The Naive Bayes Operator can accept a dependent column that has two or more discrete categories.
Note: If the dependent variable is a numeric integer, each integer is treated as a separate category. - The independence assumption treats all the predictors or variables as independently related to the outcome.
- The Naive Bayes theorem results give the normal probability curve of each possible categorical value occurring for that variable.
This operator implements the Naive Bayes algorithm from Spark MLLib.
Input
An input is a single tabular data set.
Bad or Missing Values
- Null values are not allowed and result in an error.
Configuration
If the Use all available columns as Predictors parameter is set to Yes, the operator uses all available columns as predictors, or else the specified Continuous and Categorical predictors are used. It permits you to specify the event model type and lambda parameters. The following table includes the configuration details for the Naive Bayes operator.
Parameter | Description |
---|---|
Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
Dependent Variable | Specify the categorical data column as a dependent column. |
Use all available columns as Predictors | When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors. |
Continuous Predictors |
Specify the numerical data columns as independent columns. It must be numerical column. Click Select Columns to select the required columns. Note:
The columns selected in the Categorical Predictors parameter are not available. |
Categorical Predictors |
Specify the categorical data columns as independent columns. Note:
The columns selected in the Continuous Predictors parameter are not available. |
Model Type |
The event model type is supported by Naive Bayes. The following values are available:
Default: Multinomial Note:
For more information, see the Apache Spark documentation. |
Lambda |
Specify the additive smoothing parameter. The value must be non-negative (greater than or equal to 0). Default: 1.0 |
Output
- Parameter Summary Info: Displays information about the input parameters and their current settings.
- Training Summary: Displays a table containing data for the dependent variable and for each of the categorical and continuous predictors. The dependent variable data represents the prior probability of each label.
For the Bernoulli, Complement, and Multinomial model types, the predictor data shows the conditional probability distribution of each predictor. For the Gaussian model type, the data represents the exponential (exp) of the mean value for each predictor.
- PRED_NB: The predictive value of the classification model.
- CONF_NB: The probability of the predicted value.
- INFO_NB: Overall probabilities for each class.
Example
The following example demonstrates the Naive Bayes operator.
golf: This data set contains the following information:
- Multiple columns namely outlook, temperature, wind, humidity, and play.
- Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
-
Dependent Variable: play
-
Use all available columns as Predictors: No
-
Continuous Predictors: temperature,humidity
-
Categorical Predictors: outlook,wind
-
Model Type: multinomial
-
Lambda: 1.0
These figures displays the results for the parameter settings for the golf data set.
Parameter Summary Info
Training Summary