Histogram

Analyzes the values of the selected fields of a data set, and generates a graphical representation of the frequency distribution of the numeric data.

Histogram

Information at a Glance

Category Explore
Data source type DB, HD
Sends output to other operators No
Data processing tool Pig

Algorithm

Histogram analysis calculates data frequency for a specific column.



For each column specified, users input either the number of bins to generate or the width of the bins. A bin is an interval that is divided equally between minimum and maximum value or by the width.

For example, a specific column's minimum value is 0 and maximum value is 100. If the user specifies five bins, each bin covers 20 units. If 10 bins are specified, each bin covers 10 units.

Bounds of each bin are defined as (Minimum, Maximum].

Note: When a user defines a minimum value, this value is included or displayed in the bin. The first bin begins at the lowest value above the defined minimum, and the last bin includes the defined maximum value.

Input

A data set from the preceding operator.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Bins Select Bin Configuration to choose the available columns from the input data set for analysis.

See Bin Configuration Dialog Box.

Output

Visual Output

Four sections are displayed: Counts, Cumulative Counts, Percentage, and Data.

Counts
Displays the histogram for one column at a time according to the defined groups (bins). Users can select a column from the name drop-down list.

Cumulative Counts
Displays a graph of the number of rows included with each additional bin.

Percentage
Displays a graph showing what percentage of the input column each bin represents.

Data

Summarizes information about each histogram, with numerical measures for:

  • Bin name
  • Bin number
  • Bin start point
  • Bin end point
  • Count
  • Percentage
  • Cumulative counts
  • Cumulative %


Note: To learn more about the visualization available in this operator, see Exploring Visual Results.
Data Output
None. This is a terminal operator.
Related reference