Conceptual Overviews - Categorized Histograms
In general, histograms are used to examine frequency distributions of values of variables. For example, the frequency distribution plot shows which specific values or ranges of values of the examined variable are most frequent, how differentiated the values are, whether most observations are concentrated around the mean, whether the distribution is symmetrical or skewed, whether it is multimodal (i.e., has two or more peaks) or unimodal, etc. Histograms are also useful for evaluating the similarity of an observed distribution with theoretical or expected distributions.
The histogram procedure available from the Graphs menu allows you to produce histograms broken down by one or two categorical variables, or by any other one or two sets of logical categorization rules (via multiple subsets categorization).
There are two major reasons why frequency distributions are of interest.
- You can learn from the shape of the distribution about the nature of the examined variable (e.g., a bimodal distribution may suggest that the sample is not homogeneous and consists of observations that belong to two populations that are more or less normally distributed).
- Many statistics are based on assumptions about the distributions of analyzed variables; histograms help you to test whether those assumptions are met.
Often, the first step in the analysis of a new dataset is to run histograms on all variables. Using categorized histograms can make the results more informative,
and reveal, for example, a lack of homogeneity of the sample.
Histograms vs. Breakdown
Categorized Histograms provide information similar to breakdowns (e.g., mean, median, minimum, maximum, differentiation of values, etc.; see Basic Statistics and Tables). Although specific (numerical) descriptive statistics are easier to read in a table, the overall shape and global descriptive characteristics of a distribution are much easier to examine in a graph. Moreover, the graph provides qualitative information about the distribution that cannot be fully represented by any single index. For example, the overall skewed distribution of income may indicate that the majority of people have an income that is much closer to the minimum than maximum of the range of income. Moreover, when broken down by gender and ethnic background, this characteristic of the income distribution may be found to be more pronounced in certain subgroups. Although this information will be contained in the index of skewness (for each sub-group), when presented in the graphical form of a histogram, the information is usually more easily recognized and remembered. The histogram may also reveal "bumps" that may represent important facts about the specific social stratification of the investigated population or anomalies in the distribution of income in a particular group caused by a recent tax reform.
Categorization of Values within Each Histogram
All histogram procedures offer the standard selection of categorization methods; see Method of Categorization for more details.
Those categorization methods divide the entire range of values of the examined variable into a number of categories or sub-ranges for which frequencies are counted and presented in the plot as individual columns or bars.
Categorization of Values into Component Graphs
The categorization options for assigning observations to the component graphs of the categorized histogram are equally flexible. Component graphs may be created for the levels of a categorical variable (e.g., gender), continuous variables may be categorized into a user-defined number of intervals, or user-defined logical subsetting conditions may be specified to determine each sub-group.
The latter option is particularly powerful, because it allows you to base the categorization on "rules" that reference more than one variable, and on the logical relationships between those variables (e.g., a subgroup might consist of all individuals who are male, 30 or older, and divorced or never married).
- Categorized histograms and scatterplots
- A useful application of the categorization methods for continuous variables is to represent the simultaneous relationships between three variables. Shown below is a scatterplot for two variables Load 1 and Load 2.
Now suppose you want to add a third variable (Output) and examine how it is distributed at different levels of the joint distribution of Load 1 and Load 2. The following graph could be produced:
In this graph, Load 1 and Load 2 are both categorized into five intervals, and within each combination of intervals the distribution for variable Output is computed. Note that the "box" (parallelogram) encloses approximately the same observations (cases) in both graphs shown above.