dist
Distance Matrix Calculation

Description

Creates a distance structure that represents all of the pairwise distances between objects in the data and also lists some relative methods for class dist.

Usage

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

# Generic function as.dist(m, diag = FALSE, upper = FALSE) ## Default S3 method: as.dist(m, diag = FALSE, upper = FALSE)

## S3 method for class 'dist': as.matrix(x, ...) ## S3 method for class 'dist': labels(object, ...) ## S3 method for class 'dist': print(x, diag = NULL, upper = NULL, digits = getOption("digits"), justify = "none", right = TRUE, ...)

Arguments

x a matrix (typically a data matrix). The distances computed are among the rows of x. Missing values (NAs) are allowed.

method a character string that specifies the distance method to use. You can type one of the following:
  • "euclidean" (or the alternative spelling "euclidian") is the root sum-of-squares of differences.
  • "maximum" is the maximum difference.
  • "manhattan" is the sum of absolute differences.
  • "canberra" is similar to the Manhattan distance, the distinction is that the absolute difference between the variables of the two objects is divided by the sum of the absolute variable values prior to summing.
  • "binary" is the proportion of non-zeros that two vectors do not have in common (the number of occurrences of a zero and a one, or a one and a zero divided by the number of times at least one vector has a one).
  • "minkowski" can be considered as a generalization of both the Euclidean distance and the Manhattan distance. That is,
    (sum( (abs(x[i] - x[j]))^p) )^(1.0/p)
    where p is the power number specified in argument p.
Missing values (NAs) in a row of x are not included in any distances involving that row. If the metric is "euclidean" and ng is the number of columns in which no missing values occur for the given rows, then the distance returned is sqrt(ncol(x) / ng) times the Euclidean distance between the two vectors of length ng shortened to exclude NAs.

The rule is similar for the "manhattan" metric, except that the coefficient is ncol(x)/ng. The "binary" metric excludes columns in which either row has an NA. If all values for a particular distance are excluded, the distance is NA.

diag a logical value that specifies to print the diagonal of the distance matrix. If TRUE, the diagonal is printed.
upper a logical value that specifies to print the upper triangular of the distance matrix. If TRUE, the upper triangular is printed.
p a positive real number that specifies the power number to use in the "minkowski" method. The default is 2.
m an object of class "dist", an object that inherits from class "dist", or an object that can be coerced to a matrix through the as.matrix function. The m object is normally a numeric square matrix.
x, object an object of class "dist" or an object that inherits from class "dist".
digits a numeric value that specifies the number of significant digits that should be printed in numeric or complex data. By default, the value is configured through the "digits" argument in the options function.
justify a character string that specifies the justification of character strings relative to each other. You can choose "none", "left", "right", or "centre". For more information, see format.
right a logical value that controls alignment of character strings. If TRUE (the default), output is right aligned.
... other optional arguments to be passed to or from methods.
Value
dist and as.dist return an object of class "dist" that contains the distances among the rows of x. Since there are many distances and since the result of dist is typically an argument to hclust or cmdscale, a vector is returned, rather than a symmetric matrix. For i less than j, the distance between row i and row j is element:
nrow(x) * (i - 1) - i * (i - 1) / 2 + (j - i) of the result
The length of the vector that is returned is nrow(x) * (nrow(x) -1) / 2, that is, it is of order nrow(x) squared.
Note that as.dist is a generic and only a non-visible default method is implemented.
The returned object has the following attributes:
"Size" displays the number of objects, that is, nrow(x).
"Labels" the row names of x or m (if they exist), otherwise column names of m (if they exist).
"Diag" the value of argument diag.
"Upper" the value of argument upper.
"method" the value of the distance method specified in method.
"p" the value of argument p only returned if you specified a value in the "minkowski" method.
"call" the call to create this "dist" object.
The following are non-visible methods of generic functions for class "dist":
as.matrix.dist returns a square distance matrix with row names and column names.
format.dist coerces the distance vector to character strings using a specified format.
labels.dist returns the "Labels" attribute of a dist object.
print.dist only prints out a dist object in distance matrix format.
Background
Distance measures are used in cluster analysis and in multidimensional scaling. The choice of metric may have a large impact on the results.
Differences between TIBCO Enterprise Runtime for R and Open-source R
In open-source R, the definition of the canberra method is to divide by the absolute value of the sum of the variable values. This matters only when values are negative.
Note
If the columns of a matrix are in different units, we recommend that you scale the matrix before using dist. This is because a column that is much more variable than the others will dominate the distance measure.
References
Becker, R. A., Chambers, J. M., and Wilks, A. R. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books and Software.
Borg, I. and Groenen, P. 1997. Modern Multidimensional Scaling: Theory and Applications. New York, NY: Springer.
Everitt, B. 1980. Cluster Analysis. Second Edition. New York, NY: Halsted.
Mardia, K. V., Kent, J. T., and Bibby, J. M. 1979. Multivariate Analysis. London, UK: Academic Press.
See Also
cmdscale, hclust, as.matrix, format, labels, print, scale.
Examples
# create a sample object
x <- Sdatasets::votes.repub[13:23, 1:8]
dist(x, "max")  # distances among rows by maximum
dist(t(x))  # distances among cols in Euclidean metric
dist(x, "canberra", diag = TRUE)
ret <- dist(x, "minkowski", upper = TRUE, p = 3)
attributes(ret)
print(ret)
as.matrix(ret)
format(ret, digits = 3)
labels(ret)

as.dist(matrix(1:9, nrow = 3, dimnames = list(paste("R",1:3, sep = "_"), paste("C", 1:3, sep = "_"))), diag = TRUE, upper = TRUE) ## R_1 R_2 R_3 ## R_1 0 2 3 ## R_2 2 0 6 ## R_3 3 6 0

Package stats version 6.0.0-69
Package Index