SANN - Data Selection - Sampling (CNN and ANS) Tab
You can select the Sampling (CNN and ANS) tab of the SANN - Data selection dialog to access the options described here. The performance of a neural network is measured by how well it generalizes to unseen data (like how well it predicts data that was not used during training). The issue of generalization is actually one of the major concerns when training neural networks. When the training data have been overfit (like been fit so completely that even the random noise within the particular data set is reproduced), it is difficult for the network to make accurate predictions using new data (when the network is deployed).
One way to combat this problem is to split the data into two (or three) subsets: a training sample, a testing sample, and a validation sample. These samples can then be used to 1) train the network, 2) verify (test) the performance of the networks while under training, and 3) perform n final validation test to determine how well the network predicts new data that was neither used to train the model or to test its performance when being trained.
In SANN, the assignment of the cases to the subsets can be executed randomly or based upon a special subset variable in the data set.
| Option | Description |
|---|---|
| Sampling Method | |
| Random sampling | Random sample sizes: Specifies that Statistica randomly assigns cases to subsets based on specified percentages with the total percentage summing to no more than 100. If you do not want to split the data into subsets, simply set the value of Test (%) and Validation (%) to zero. Note, however, that the use of at least one hold out sample (test sample) is strongly recommended to aid in training the neural network models. Also note that the sum of the sample percentages can be less than 100. This can be the case if the number of the data cases present in the data set is large. Training of neural networks on large data sets can be time consuming, and the random omission of cases from a large data set can help in reducing the computation time while also producing good models (models that performs well on real data) provided you include a reasonable percentage of data cases for the analysis.
|
| Subset variable | |
| Sampling variable | Select the Sampling variable option button when your data set contains variables that indicate to which sample (Training, Testing, Validation) each case belongs. You need to specify a spreadsheet variable and codes to identify which cases are used for the various samples.
|