Spark RDD and dataframes

In this section, our focus turns to data and how Apache Spark represents data and organizes data. Here, we will provide an introduction to the Apache Spark RDD and Apache Spark dataframes.

After this section, readers will master these two fundamental Spark concepts, RDD and Spark dataframe, and be ready to utilize them for Machine Learning projects.

Spark RDD

Apache Spark's primary data abstraction is in the form of a distributed collection of items, which is called Resilient Distributed Dataset (RDD). RDD is Apache Spark's key innovation, which makes its computing faster and more efficient than others.

Specifically, an RDD is an immutable collection of objects, which spreads across a cluster. It is statically typed, for example ...

Get Apache Spark Machine Learning Blueprints now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.