The RDD API

The RDD is a read-only, partitioned, fault-tolerant collection of records. From a design perspective, there was a need for a single data structure abstraction that hides the complexity of dealing with a wide variety of data sources, be it HDFS, filesystems, RDBMS, NOSQL data structures, or any other data source. The user should be able to define the RDD from any of these sources. The goal was to support a wide array of operations and let users compose them in any order.

RDD basics

Each dataset is represented as an object in Spark's programming interface called RDD. Spark provides two ways for creating RDDs. One way is to parallelize an existing collection. The other way is to reference a dataset in an external storage system such as ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.