Chapter 11

(Dis)similarity measures

11.1 Introduction

While exploring and exploiting similarity patterns in data is at the heart of the clustering task and therefore inherent for all clustering algorithms, not all of them adopt an explicit similarity measure to drive their operation. Such similarity, or actually more often dissimilarity measures (since they typically take minimum values for maximum similarity), are functions that assign real values to instance pairs from the domain and can be used by clustering algorithms both in the cluster formation and cluster modeling processes. Such algorithms can be referred to as similarity-based or—perhaps more appropriately—dissimilarity-based clustering algorithms.

This chapter presents a selection of the most commonly used general-purpose similarity and dissimilarity measures for clustering, providing a necessary common background for presenting the most widely used dissimilarity-based clustering algorithms. The latter are described in detail in Chapters 12 and 13.

Example 11.1.1

Each dissimilarity measure presented in this chapter will be illustrated with a simple R implementation, applied to the weathercl data. Utility functions from the dmr.util package as well as the standardization implementation available in the dmr.trans package will be also used. The R code presented below loads the packages and the dataset.

Get Data Mining Algorithms: Explained Using R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Mining Algorithms: Explained Using R by Pawel Cichosz

Chapter 11

(Dis)similarity measures

11.1 Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly