Summary

In this section, we learned how to identify groups of similar items in a dataset, an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explored different ways of calculating the similarity between datapoints and described what kinds of data these metrics might best apply to. We examined both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every datapoint starts as its own cluster. Using a number of datasets, we showed examples where these algorithms will perform better or worse, and some ways to optimize them. We also saw our first (small) data pipeline, a clustering application in PySpark using streaming ...

Get Mastering Predictive Analytics with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.