Dataset - a high-level unifying Data API

A dataset is an immutable collection of objects which are modelled/mapped to a traditional relational schema. There are four attributes that distinguish it as the preferred method going forward. We particularly find the Dataset API appealing since we find it familiar to RDDs with the usual transformational operators (for example, filter(), map(), flatMap(), and so on). The Dataset will follow a lazy execution paradigm similar to RDD. The best way to try to reconcile DataFrames and Datasets is to think of a DataFrame as an alias that can be thought of as Dataset[Row].

  • Strong type safety: We now have both compile-time (syntax errors) and runtime safety in a unified Data API, which helps the ML developer ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.