DataFrames

Like RDDs, they are an immutable and distributed collection of records partitioned across the worker nodes in a Spark cluster. However, unlike RDDs, data is organized into named columns conceptually similar to tables in relational databases and tabular data structures found in other programming languages, such as Python and R. Because DataFrames offer the ability to impose a schema on distributed data, they are more easily exposed to more familiar programming languages such as SQL, which makes them a popular and arguably easier data structure to work with and manipulate, as well as being more efficient than RDDs.

The main disadvantage of DataFrames however is that, similar to Spark SQL string queries, analysis errors are only caught ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.