Workspace Node: SANN Regression/Classification/Time Series/Clustering - Specifications - Sampling (CNN and ANS) Tab

In the SANN Regression, SANN Classification, SANN Time Series (Regression), SANN Time Series (Classification), or SANN Clustering  node dialog box, under the Specifications heading, select the Sampling (CNN and ANS) tab to access the following options. This tab is not available if the Subsampling (random, bootstrap) option button is selected on the Specifications - Quick tab.

The performance of a neural network is measured by how well it generalizes to unseen data (i.e., how well it predicts data that was not used during training). The issue of generalization is actually one of the major concerns when training neural networks. When the training data have been overfit (i.e., been fit so completely that even the random noise within the particular data set is reproduced), it is difficult for the network to make accurate predictions using new data (i.e., when the network is deployed). See overfitting for more details.

One way to combat this problem is to split the data into two (or three) subsets: a training sample, a testing sample, and a validation sample. These samples can then be used to 1) train the network, 2) verify (test) the performance of the networks while under training, and 3) perform n final validation test to determine how well the network predicts "new" data that was neither used to train the model or to test its performance when being trained.

In SANN, the assignment of the cases to the subsets can be executed randomly or based upon a special subset variable in the data set.

Sampling Method.

Random sampling.

Element Name Description
Random sample sizes Select this option button to specify that Statistica will randomly assign cases to subsets based on specified percentages with the total percentage summing to no more than 100. If you do not want to split the data into subsets, simply set the value of Test (%) and Validation (%) to zero. Note, however, that the use of at least one hold out sample (test sample) is strongly recommended to aid in training the neural network models. Also note that the sum of the sample percentages can be less than 100. This can be the case if the number of the data cases present in the data set is large. Training of neural networks on large data sets can be time consuming, and the random omission of cases from a large data set can help in reducing the computation time while also producing good models (models that performs well on real data) provided you include a reasonable percentage of data cases for the analysis.
Train (%) Specify the percent of valid cases to use in the training sample. Must be larger than 0 and smaller than or equal to 100.
Test (%) Use this option to randomly assign cases to the test sample. Specify here the percentage of cases to use. To select no test, sample simply enter 0 (not recommended).
Validation (%) Use this option to randomly assign cases to the validation sample. Specify here the percentage of cases to use. To select no validation sample, simply enter 0 (not recommended).
Seed for sampling The positive integer value entered in this box is used as the seed for a random number generator that produces the random samples from the data. Starting from the same seed will yield the same sample. If you want to create a different data sample, change the seed value.
Subset variable
Sampling variable Select this option button when your data set contains variables that indicate to which sample (Training, Testing, and Validation) each case belongs. You will then need to specify a spreadsheet variable and codes to identify which cases are used for the various samples.
Training sample Click this button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable that uniquely identifies the cases to be used in the training sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.
Testing sample Click this button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable that uniquely identifies the cases to be used in the testing sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.
Validation sample Click this button to display the Sampling variable dialog box, which is used to select a sample identifier variable and a code for that variable which uniquely identifies the cases to be used in the validation sample. After you have specified the sample identifier and code, the variable name and code will be displayed adjacent to the button.

Options / C / W. See Common Options.

OK Click the OK button to accept all the specifications made in the dialog box and to close it. The analysis results will be placed in the Reporting Documents node after running (updating) the project.