Chapter 9. Unsupervised Learning with MLlib

This chapter will cover how we can do unsupervised learning using MLlib, Spark's machine learning library.

This chapter is divided into the following recipes:

  • Clustering using k-means
  • Dimensionality reduction with principal component analysis
  • Dimensionality reduction with singular value decomposition

Introduction

The following is Wikipedia's definition of unsupervised learning:

"In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data."

In contrast to supervised learning where we have labeled data to train an algorithm, in unsupervised learning we ask the algorithm to find a structure on its own. Let's take a look at the following sample dataset: ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.