hclust
Hierarchical Clustering
Description
Performs hierarchical clustering on a distance or similarity structure.
Usage
hclust(d, method = "complete", members = NULL)
## S3 method for class 'hclust':
print(x, ...)
Arguments
d |
a distance structure or a distance matrix.
A distance measure is a measure such as Euclidean distance,
which has a small value for similar observations.
Normally, this is the result of the function dist,
but it can be any data of the form returned by dist,
or a full symmetric matrix.
For details on the required structure, see the dist help file.
Missing values (NAs) are not allowed.
|
method |
a character string giving the clustering method.
The methods currently implemented are "complete", "single",
and "average".
The characters are partially matched.
|
members |
a vector of length n if there were n objects in the original data.
The default is that all elements are value 1.
|
x |
an object of class "hclust".
Normally, it is returned by the function hclust.
|
... |
other optional arguments passed to print function.
|
Details
At each stage the two "nearest" clusters are combined to form one bigger cluster.
(Initially each cluster contains a single point.)
Some different clustering methods are provided.
- method="single" specifies that the distance between two clusters is the
minimum distance between a point in the first cluster and a point in the
second cluster. method="single" typically creates long thin clusters.
- method="average" specifies that the distance between clusters
is the average of the distances between the points in one cluster and
the points in the other cluster.
- method="complete" uses the largest distance between a point in
one cluster and a point in the other cluster. method="complete" usually
forms more spherical clusters.
For other clustering methods, see the details in the references.
In hierarchical cluster displays, a decision is needed
at each merge to specify which subtree should go on the left
and which subtree should go on the right.
Because, for n individuals, there are n--1 merges, there are
2^(n--1) possible orderings for the leaves in a cluster tree.
The default algorithm in hclust is to order the
subtrees so that the tighter cluster is on the left
(the last merge of the left subtree is at a lower value than
the last merge of the right subtree).
Individuals are the tightest clusters possible,
and merges involving two individuals place them in order
by their observation number.
print.hclust is a hidden S3 method of generic function print for
class "hclust".
It prints some components information of x in lines:
matched call, clustering method, distance method, and the number of objects.
Value
returns a "tree" representing the clustering,
which is a list of class
"hclust"
consisting of the following components:
merge |
an (n-1) by 2 matrix,
if there were n objects in the original data.
Row i of merge describes the
merging of clusters at step i of the clustering.
If an element j in the row is negative, then object -j
was merged at this stage.
If j is positive, then the merge was with the cluster formed at
the (earlier) stage j of the algorithm.
|
height |
a vector of the clustering "height";
that is, the distance between clusters merged at the successive stages.
|
order |
a vector giving a permutation of the original objects suitable for plotting,
in the sense that a cluster plot using this ordering
does not have crossings of the branches.
|
labels |
the "Labels" attribute of d.
|
method |
a character string representing the clustering method.
|
call |
the matched call.
|
dist.method |
the distance method; that is, the "method" attribute of d.
|
Background
Cluster analysis divides datapoints into groups of points that are "close"
to each other.
The hclust function continues to aggregate groups together
until there is just one big group.
If it is necessary to choose the number of groups,
this can be decided subsequently.
Other methods (such as kmeans) require that the number of groups be
decided from the start.
By changing the distance metric and the clustering method,
several different cluster trees can be created from a single dataset.
No one method seems to be useful in all situations.
Single linkage ("single") can work poorly
if two distinct groups have a few "stragglers" between them.
Differences between TIBCO Enterprise Runtime for R and Open-source R
Open-source R gives an error when the input d is a symmetric matrix,
while TIBCO Enterprise Runtime for R allows it.
The methods: "mcquitty", "ward", "median"
and "centroid"
which are available in R are not yet implemented in TIBCO Enterprise Runtime for R.
If one of the unimplemented methods is specified,
a warning is issued and the default method ("complete") is used.
References
Anderberg, M. R. 1973. Cluster Analysis for Applications. New York, NY: Academic Press.
Becker, R. A., Chambers, J. M., and Wilks, A. R. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books and Software.
Everitt, B. 1974. Cluster Analysis. London: Heinemann Educ. Books.
Everitt, B. 1980. Cluster Analysis. Second Edition. New York, NY: Halsted.
Gordon, A. D. 1999. Classification. London, UK: Chapman and Hall.
Hartigan, J. A. 1975. Clustering Algorithms. New York, NY: Wiley.
McQuitty, L. L. 1966. Similarity analysis by reciprocal pairs for discrete and continuous data. Educational and Psychological Measurement. Volume 26. 825-831.
Murtagh, F. 1985. Multidimensional clustering algorithms. COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag.(for algorithmic details of algorithms used.).
Sneath, P. H. A. and Sokal, R. R. 1973. Numerical Taxonomy. San Francisco, CA: Freeman.
See Also
Examples
# Create a sample object using a built-in dataset
x <- Sdatasets::longley.x
hx <- hclust(dist(x))
hx
#
# Call:
# hclust(d = dist(x))
#
# Cluster method : complete
# Distance : euclidean
# Number of objects : 16
# Try different distance measure and clustering methods:
hclust(dist(x, "maximum"), method="ave")
hclust(dist(x, "maximum"), method="complete",
member = c(rep(1, 10), 2, 1, 2, 1,1,1))
votes.clust <- hclust(dist(Sdatasets::votes.repub), "ave")
votes.clust
#
# Call:
# hclust(d = dist(Sdatasets::votes.repub), method = "ave")
#
# Cluster method : average
# Distance : euclidean
# Number of objects : 50