Datasets

Apache Spark Datasets are an extension of the DataFrame API that provide a type-safe object-oriented programming interface. This API was first introduced in the 1.6 release. Spark 2.0 version brought out unification of DataFrame and Dataset APIs. DataFrame becomes a generic, untyped Dataset; or a Dataset is a DataFrame with an added structure. The term "structure" in this context refers to a pattern or an organization of underlying data, more like a table schema in RDBMS parlance. The structure imposes a limit on what can be expressed or contained in the underlying data. This in turn enables better optimizations in memory organization as well as physical execution. Compile-time type checking leads to catching errors earlier than during ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.