Summary Statistics (DB)

Provides useful summary information for the selected columns of the data set passed by the preceding operator.

Summary Statistics

Information at a Glance

Category Explore
Data source type DB
Sends output to other operators No
Data processing tool n/a
Note: The Summary Statistics (DB) operator is for database data only. For Hadoop data, use the Summary Statistics (HD) operator.

Input

A data set from the preceding operator.

Restrictions

The Summary Statistics operator does not work with generic JDBC data sets. See the Operator and Data Source Compatibility Matrix.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns Select the numeric columns for which the summary statistics should be displayed.
  • Click Select Columns to open the dialog box for selecting the available columns from the input dataset for analysis.
  • Check or uncheck the box in front of the column names to select or deselect the column.
  • Click All to select all the columns.
  • Click OK to commit the selection changes.
  • Click Cancel to cancel all the selection changes.
Group By Click Select Columns to open the dialog box for selecting the available columns from the input dataset to group results by.
Calculate the Number of Distinct Values (slower)

Determines whether to calculate the number of distinct values for selected columns.

Calculating distinct values can add significant processing time.

Default value: Yes

Number of most Common Values to Display

Determines the maximum number of the most common values to output for each column.

Only enabled if Calculate the Number of Distinct Values is enabled.

Output Schema The schema for the output table or view.
Output Table The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Storage Parameters Advanced database settings for the operator output. Available only for TABLE output.

See Storage Parameters Dialog Box for more information.

Drop If Exists Specifies whether to overwrite an existing table.
  • Yes - If a table with the name exists, it is dropped before storing the results.
  • No - If a table with the name exists, the results window shows an error message.

Output

Visual Output
A table that displays the analysis results of the selected fields. The following list shows the default table contents.
  • Name
  • Data type
  • Count
  • Unique value count
  • Null value count
  • Empty value count
  • Zero value count
  • Min value
  • 25% (approx.) - Approximate 25% value for numerical columns.
  • Median (approx.) - Approximate median value for numerical columns.
  • 75% (approx.) - Approximate 75% value for numerical columns.
  • Maximum value
  • Standard deviation
  • Average
  • Positive value count
  • Negative value count
  • Most Common (Value) - The most common value for the column.
  • Most Common (Percentage) - The percentage of the total which are the most common value.
  • 2nd Most Common (Value) - The second most common value.
  • 2nd Most Common (Value) - The percentage of the total which are the second most common value.
Data Output
A data set of the analysis results (that is, the same data shown in the visual output).