Summary Statistics (HD)

Provides useful summary information for the selected columns of the data set passed by the preceding operator.

Summary Statistics

Information at a Glance

Category Explore
Data source type HD
Sends output to other operators No
Data processing tool Pig
Note: The Summary Statistics (HD) operator is for Hadoop data only. For database data, use the Summary Statistics (DB) operator.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns Select the numeric columns for which the summary statistics should be displayed. Click Select Columns to open the dialog box to select the available columns from the input data set for analysis.
Group By Columns from the input data set by which to group results.
Calculate the Number of Distinct Values (slower)
Determines whether to calculate the number of distinct values for selected columns - Yes (the default) or No.
Note: Calculating distinct values can add significant processing time.
Number of Most Common Values to Display Determines the maximum number of the most common values to output for each column.

Only enabled if Calculate the Number of Distinct Values is enabled.

Store Results? Specifies whether to store the results.
  • true - results are stored.
  • false - the data set is passed to the next operator without storing.
Results Location The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name The name of the file in which to store the results.
Overwrite Specifies whether to delete existing data at that path and file name.
  • Yes - if the path exists, delete that file and save the results.
  • No - Fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
A table that displays the analysis results of the selected fields. The following list shows the default table contents.
  • Name
  • Data type
  • Count
  • Unique value count
  • Null value count
  • Empty value count
  • Zero value count
  • Min value
  • 25% (approx.) - Approximate 25% value for numerical columns.
  • Median (approx.) - Approximate median value for numerical columns.
  • 75% (approx.) - Approximate 75% value for numerical columns.
  • Maximum value
  • Standard deviation
  • Average
  • Positive value count
  • Negative value count
  • Most Common (Value) - The most common value for the column.
  • Most Common (Percentage) - The percentage of the total which are the most common value.
  • 2nd Most Common (Value) - The second most common value.
  • 2nd Most Common (Value) - The percentage of the total which are the second most common value.
Data Output
A data set of the analysis results (that is, the same data shown in the visual output).