Naive Bayes with Deployment (Classification)

Full-featured implementation of Naive Bayes classifier for classification problems. The final solution is automatically stored for deployment.

General

Element Name Description
Detail of computed results reported Detail of computed results; if Minimal detail is requested, spreadsheets of analysis summary, model specifications as well descriptive statistics (classification statistics) will be displayed; at the Comprehensive level of detail, a spreadsheet of predictions and their accuracy as well as their histogram plots will be displayed; in addition to the above, the All results level will display a spreadsheet (if the 'Creates residual statistics' option is selected) containing all data set variables and their statistics including predictions and accuracy (whichever applicable).
Missing data deletion Specifies the substitution method for missing data. Casewise excludes cases that contain any missing data for any of the selected variables in the analysis. Mean substitution replaces missing data by the means for the respective variables. The third option (Use Missing) will treat each case on a variable level (Note: This option is not applicable for categorical dependent and predictor variables).
Generate datasource, if N for input less than Generate a data source for further analyses with other Data Miner nodes if the input datasource has fewer than k observations, as specified in this edit field; note that parameter k (number of observations) will be evaluated against the number of observations in the input data source, not the number of valid or selected observations.

Sampling

Element Name Description
Divide data into train and test samples Divides the data set into training and test sample. The training subset is used to fit the model while the test subset serves as an independent check of its performance.
Sampling method Sampling method to be used for dividing the data set into train and test subsets. Random sampling will divide the data set into train and testing samples in a random fashion. This is in contrast to the First N method which selects the first N cases as the training set and the rest as the testing sample. NOTE: you may also use a learning/testing indicator variable method for sampling from the data. You can access this functionality via the Advanced tab of the data spreadsheet in the Data Acquisition of Statistica Data Miner environment. Selecting this method (i.e. learning/testing indicator) will override any choice of sampling you make on this tab.
Size of training sample (%) Specifies the percentage of data cases that will be used to form the training sample. The remaining valid cases in the data set will be used as the test sample.
Seed Specifies the random generator seed for random sampling of data into train and test subsets.
Use first N cases Selects the first N valid cases in the data set as training subset. The rest are used for testing.

Distributions

Element Name Description
Threshold Places a minimum bound on class conditional probabilities defined by the categorical inputs.
Input distributions Specifies the distributions of the input (independent) continuous variables. As an example, if you have four categorical independent variables and want to assign the first variable a normal distribution, the second a gamma distribution and the last two lognormal and Poisson distributions, respectively, you may enter the string '1 | 3 | 2 | 4', excluding the quotes. More generally you may write 'index of normal inputs | index of lognormal inputs | index of gamma inputs | index of Poisson inputs'. The default string '*' will assign a normal distribution for all continuous independent variables. Note that categorical inputs are always assigned discrete distributions and so their indices may not be included in the edit field.

Memory usage

Element Name Description
Restrict memory usage Restrict the amount of memory that can be used by the analysis.
Amount of memory that can be used by the analysis Amount of memory that can be used by the analysis.

Results

Element Name Description
Prior type Prior types for the class labels of the dependent variables.
Custom prior If you select the prior type as custom, you will need to determine the specific values for each of the class labels.
Subset used to generate results Select the subset for which the results should be displayed.
Include inputs Includes the independent variables in spreadsheets and histograms.
Include outputs Includes the dependent variables in spreadsheets and histograms.
Include predictions Includes predictions in spreadsheets and histograms.
Include residuals Includes residuals in spreadsheets and histograms.
Include confidence levels Includes confidence levels in spreadsheets and histograms.
Creates residual statistics Creates predicted and residual statistics for each case depending on the selected level of details.

Deployment

Deployment is available if the Statistica installation is licensed for this feature.

Element Name Description
Generates C/C++ code Generates C/C++ code for deployment of predictive model.
Generates SVB code Generates Statistica Visual Basic code for deployment of predictive model.
Generates PMML code Generates PMML (Predictive Models Markup Language) code for deployment of predictive model. This code can be used via the Rapid Deployment options to efficiently compute predictions for (score) large data sets.