- Data Analysis with Open Source Tools
- Dedication
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- A Note Regarding Supplemental Files
- Preface
- 1. Introduction
- I. Graphics: Looking at Data
- II. Analytics: Modeling Data
- III. Computation: Mining Data
- IV. Applications: Using Data
- A. Programming Environments for Scientific Computation and Data Analysis
- B. Results from Calculus
- C. Working with Data
- D. About the Author
- Index
- About the Author
- Colophon
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- Copyright

**THE TERM
CLUSTERING REFERS TO THE PROCESS OF FINDING GROUPS
OF POINTS WITHIN A DATA SET THAT ARE IN** some way
“lumped together.” It is also called

I regard clustering as an *exploratory* method:
a computer-assisted (or even computationally driven) approach to
discovering structure in a data set. As an exploratory technique, it
usually needs to be followed by a confirmatory analysis that validates
the findings and makes them more precise.

Clustering is a lot of fun. It is a rich topic with a wide variety
of different problems, as we will see in the next section, where we
discuss the different *kinds* of cluster one may
encounter. The topic also has a lot of intuitive appeal, and most
clustering methods are rather straightforward. This allows for all sorts
of ad hoc modifications and enhancements to accommodate the specific
problem one is working on.

Clustering is not a very rigorous field: there are precious few established results, rigorous theorems, or algorithmic guarantees. In fact, the whole notion of a “cluster” is not particularly well defined. Descriptions such as “groups of points that are similar” or “close to each other” are insufficient, because clusters must ...