Datasets

Datasets extend DataFrames by providing type safety. This means that in the preceding example of the missing column, the Dataset API will throw a compile time error. In fact, DataFrames are actually an alias for Dataset[Row] where a Row is an untyped object that you may see in Spark applications written in Java. However, because R and Python have no compile-time type safety, this means that Datasets are not currently available to these languages.

There are numerous advantages to using the DataFrame and Dataset APIs over RDDs, including better performance and more efficient memory usage. The high-level APIs offered by DataFrame and Dataset also make it easier to perform standard operations such as filtering, grouping, and calculating ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.