Chapter 11

(Dis)similarity measures

11.1 Introduction

While exploring and exploiting similarity patterns in data is at the heart of the clustering task and therefore inherent for all clustering algorithms, not all of them adopt an explicit similarity measure to drive their operation. Such similarity, or actually more often dissimilarity measures (since they typically take minimum values for maximum similarity), are functions that assign real values to instance pairs from the domain and can be used by clustering algorithms both in the cluster formation and cluster modeling processes. Such algorithms can be referred to as similarity-based or—perhaps more appropriately—dissimilarity-based clustering algorithms.

This chapter presents a selection of the most commonly used general-purpose similarity and dissimilarity measures for clustering, providing a necessary common background for presenting the most widely used dissimilarity-based clustering algorithms. The latter are described in detail in Chapters 12 and 13.

Get Data Mining Algorithms: Explained Using R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.