Hierarchical Clustering

hclust

Description

Performs hierarchical clustering on a distance or similarity structure.

Usage

hclust(d, method = "complete", members = NULL) 
## S3 method for class 'hclust':
print(x, ...)

Arguments

d	a distance structure or a distance matrix. A distance measure is a measure such as Euclidean distance, which has a small value for similar observations. Normally, this is the result of the function dist, but it can be any data of the form returned by dist, or a full symmetric matrix. For details on the required structure, see the dist help file. Missing values (NAs) are not allowed.
method	a character string giving the clustering method. The methods currently implemented are "complete", "single", and "average". The characters are partially matched.
members	a vector of length n if there were n objects in the original data. The default is that all elements are value 1.
x	an object of class "hclust". Normally, it is returned by the function hclust.
...	other optional arguments passed to print function.

Details

At each stage the two "nearest" clusters are combined to form one bigger cluster. (Initially each cluster contains a single point.) Some different clustering methods are provided.

method="single" specifies that the distance between two clusters is the minimum distance between a point in the first cluster and a point in the second cluster. method="single" typically creates long thin clusters.
method="average" specifies that the distance between clusters is the average of the distances between the points in one cluster and the points in the other cluster.
method="complete" uses the largest distance between a point in one cluster and a point in the other cluster. method="complete" usually forms more spherical clusters.

For other clustering methods, see the details in the references.

In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which subtree should go on the right. Because, for n individuals, there are n--1 merges, there are 2^(n--1) possible orderings for the leaves in a cluster tree. The default algorithm in hclust is to order the subtrees so that the tighter cluster is on the left (the last merge of the left subtree is at a lower value than the last merge of the right subtree). Individuals are the tightest clusters possible, and merges involving two individuals place them in order by their observation number.

print.hclust is a hidden S3 method of generic function print for class "hclust". It prints some components information of x in lines: matched call, clustering method, distance method, and the number of objects.

Value

returns a "tree" representing the clustering, which is a list of class "hclust" consisting of the following components:

merge	an (n-1) by 2 matrix, if there were n objects in the original data. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then object -j was merged at this stage. If j is positive, then the merge was with the cluster formed at the (earlier) stage j of the algorithm.
height	a vector of the clustering "height"; that is, the distance between clusters merged at the successive stages.
order	a vector giving a permutation of the original objects suitable for plotting, in the sense that a cluster plot using this ordering does not have crossings of the branches.
labels	the "Labels" attribute of d.
method	a character string representing the clustering method.
call	the matched call.
dist.method	the distance method; that is, the "method" attribute of d.

Background

Cluster analysis divides datapoints into groups of points that are "close" to each other. The hclust function continues to aggregate groups together until there is just one big group. If it is necessary to choose the number of groups, this can be decided subsequently. Other methods (such as kmeans) require that the number of groups be decided from the start.

By changing the distance metric and the clustering method, several different cluster trees can be created from a single dataset. No one method seems to be useful in all situations. Single linkage ("single") can work poorly if two distinct groups have a few "stragglers" between them.

Differences between Spotfire Enterprise Runtime for R and Open-source R

Open-source R gives an error when the input d is a symmetric matrix, while Spotfire Enterprise Runtime for R allows it.

The methods: "mcquitty", "ward", "median" and "centroid" which are available in R are not yet implemented in Spotfire Enterprise Runtime for R. If one of the unimplemented methods is specified, a warning is issued and the default method ("complete") is used.

References

Anderberg, M. R. 1973. Cluster Analysis for Applications. New York, NY: Academic Press.

Becker, R. A., Chambers, J. M., and Wilks, A. R. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books and Software.

Everitt, B. 1974. Cluster Analysis. London: Heinemann Educ. Books.

Everitt, B. 1980. Cluster Analysis. Second Edition. New York, NY: Halsted.

Gordon, A. D. 1999. Classification. London, UK: Chapman and Hall.

Hartigan, J. A. 1975. Clustering Algorithms. New York, NY: Wiley.

McQuitty, L. L. 1966. Similarity analysis by reciprocal pairs for discrete and continuous data. Educational and Psychological Measurement. Volume 26. 825-831.

Murtagh, F. 1985. Multidimensional clustering algorithms. COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag.(for algorithmic details of algorithms used.).

Sneath, P. H. A. and Sokal, R. R. 1973. Numerical Taxonomy. San Francisco, CA: Freeman.