PCA in Apache Spark

Let's now return to our transformed pipe-delimited user-community movie ratings dataset, movie-ratings-data/user-movie-ratings.csv, which contains ratings by 300 users covering 3,000 movies. We will develop an application in Apache Spark that seeks to reduce the dimensionality of this dataset while preserving its structure using PCA. To do this, we will go through the following steps:

The following subsections describe each of the pertinent cells in the corresponding Jupyter notebook for this use case, called chp05-02-principal-component-analysis.ipynb. This can be found in the GitHub repository accompanying this book.
  1. First, let's load the transformed, pipe-delimited user-community movie ratings dataset into a Spark ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.