Dimensionality reduction with principal component analysis

Dimensionality reduction is the process of reducing the number of dimensions or features. A lot of real data contains a very high number of features. It is not uncommon to have thousands of features. Now, we need to drill down to features that matter.

Dimensionality reduction serves several purposes such as:

  • Data compression
  • Visualization

When the number of dimensions is reduced, it reduces the disk footprint and memory footprint. Last but not least; it helps algorithms to run much faster. It also helps reduce highly correlated dimensions to one.

Humans can only visualize three dimensions, but data can have a much higher number of dimensions. Visualization can help find hidden patterns in the ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.