## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

# Clustering

Another important data mining technique is clustering. Clustering is a way to find similar sets of observations in a data set; groups of similar observations are called clusters. There are several functions available for clustering in R.

## Distance Measures

To effectively use clustering algorithms, you need to begin by measuring the distance between observations. A convenient way to do this in R is through the function `dist` in the `stats` package:

`dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)`

The `dist` function computes the distance between pairs of objects in another object, such as a matrix or a data frame. It returns a distance matrix (an object of type `dist`) containing the computed distances. Here is a description of the arguments to `dist`.

ArgumentDescriptionDefault
xThe object on which to compute distances. Must be a data frame, matrix, or `dist` object.
methodThe method for computing distances. Specify `method="euclidean"` for Euclidean distances (2-norm), `method="maximum"` for the maximum distance between observations (supremum norm), `method="manhattan"` for the absolute distance between two vectors (1-norm), `method="canberra"` for Canberra distances (see the help file), `method="binary"` to regard nonzero values as 1 and zeros as 0, or `method="minkowski"` to use the p-norm (the pth root of the sum of the pth powers of the differences of the components).`"euclidean"`
diagA logical value specifying whether the diagonal of the distance matrix should be printed by `print.dist ...`

## With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

No credit card required