Chapter 7. Unsupervised Learning at Scale

In the previous chapters, the focus of the problem was on predicting a variable, which could have been a number, class, or category. In this chapter, we will change the approach and try to create new features and variables at scale, hopefully better for our prediction purposes than the ones already included in the observation matrix. We will first introduce the unsupervised methods and illustrate three of them, which are able to scale to big data:

  • Principal Component Analysis (PCA), an effective way to reduce the number of features
  • K-means, a scalable algorithm for clustering
  • Latent Dirichlet Allocation (LDA), a very effective algorithm able to extract topics from a series of text documents

Unsupervised methods ...

Get Python: Real World Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.