Workspace Node: Data Health Check Summary - Specifications - Sparse Data Tab
In the Data Health Check Summary node dialog box, under the Specifications heading, select the Sparse Data tab to access the following options.
Element Name | Description |
---|---|
Run sparse data check | Select this check box to perform the sparse data check as defined by the rest of the options on this tab. |
Remove missing data | The options in this group box specify how Statistica removes missing data. |
Maximum % of MD per variable | Enter the % of missing data that is allowed per variable. If the amount of missing data exceeds the specified threshold, Statistica will identify the variable as sparse. |
Maximum % of MD per case | Enter the % of missing data that is allowed per case. If the amount of missing data exceeds the specified threshold, Statistica will identify the case as sparse. |
Repeat until all invalid cases and variables have been removed | Deleting cases from the data set can affect the amount of missing data in a variable and vice versa. In order to ensure that all cases and variables satisfy the missing data requirement upon completion of the operation, select this check box to let Statistica perform the procedure to identify missing data until all cases and variables have been removed. |
Highlight missing data in sparse spreadsheet | Cases identified as sparse are displayed in a results spreadsheet. Select this check box to highlight those cells that contain missing data. Note that this may take a long time for large, sparse data sets. If the input data set has more than 20,000 cases, selecting the Highlight missing data in sparse spreadsheet check box will significantly affect generation and loading time of sparse spreadsheets. |
Identify categorical variables that levels with fewer than _ percent of cases / _ cases | Select this check box, and then select one of the option buttons to have Statistica identify those categorical variables that have levels that contain either a small percentage of cases or a small number of cases. It may be helpful to identify those levels with few cases in order to bin them with other categories prior to modeling. |
Identify the condition when casewise deletion of missing data will delete _ percent or more of cases | Many analyses require complete data in order for a case to be used, that is, missing data is not allowed for any given variable (exceptions to this include C&RT where surrogate splits can handle missing data). With a large number of variables it may be difficult to easily determine the number of complete cases in the data set. Select this check box in order to have Statistica check to see if the percentage of incomplete rows in the data set is high. The missing data plot will not be generated if the number of selected variables is greater than 300.
Options. See Common Options. |
OK | Click the OK button to accept all the specifications made in the dialog box and to close it. |
Copyright © 2021. Cloud Software Group, Inc. All Rights Reserved.