Workspace Node: Data Health Check Summary - Specifications - Outliers Tab
In the Data Health Check Summary node dialog box, under the Specifications heading, select the Outliers tab to access the following options.
Element Name | Description |
---|---|
Run outlier detection | Select this check box to perform outlier detection as defined by the other options on this tab. |
Continuous variables | The options in this group box pertain to how Statistica identifies outliers for continuous variables in the data set. |
Outside k standard deviations from mean | In this outlier detection test, all values that are further than k times the standard deviation away from the mean are identified as outliers. Values that are k times the standard deviation below the mean or k times the standard deviation above the mean are identified as outliers. The outlier coefficient, k, must be between 1 and 10. Specify multiples of the standard deviation (sigma), e.g., specify 3 to define the non-outlier range mean-3*sigma <= mean <=mean + 3*sigma. |
Tukey (box plot) | In the Tukey outlier detection tests, Statistica determines outliers based on a user-specified outlier coefficient (also known as the Tukey hinge distance factor). A data value is considered an outlier if
data point value > UBV + o.c.*(UBV - LBV) data point value < LBV - o.c.*(UBV - LBV) where UBV is the upper bound value of the box in the plot (e.g., the mean + std. err. or the 75th percentile), LBV is the lower bound value of the box in the plot (e.g., the mean - std. err. or the 25th percentile), and o.c. is the user specified outlier coefficient. Specify an outlier coefficient, e.g., 1.5, to define the non-outlier range as 25th percentile - 1.5*(IQR) <= X <= 75th percentile + 1.5 (IQR), where IQR is the interquartile range. See the following diagram for more details. |
Outlier coefficient | Outlier coefficient used in one of the above outlier tests for continuous variables, that is, either the number of standard deviations or the Tukey hinge distance factor.
Categorical variables. The options in this group box pertain to how Statistica identifies outliers for categorical variables in the data set. |
Less than percentage | Cases with observed factor levels that contain less than the user-specified percentage of cases will be identified as outliers. |
Less than number of observations | Cases with observed factor levels that contain less than the user-specified number of cases will be identified as outliers. |
Outlier threshold | The percentage/number cases threshold used in the outlier test for categorical variables. |
Repeat until all outliers have been identified | Select this check box to iterate recoding until no further outliers are found. |
Highlight outliers in summary spreadsheet | Select this check box to highlight outliers in the summary spreadsheet. Note that this operation make take a long time for large data sets.
Options. See Common Options. |
OK | Click the OK button to accept all the specifications made in the dialog box and to close it. |