There's more...

Please take a note of some interesting points on datasets:

  • Datasets use lazy evaluation
  • Datasets take advantage of the Spark SQL Catalyst optimizer
  • Datasets take advantage of the tungsten off-heap memory management
  • There are plenty of systems that will remain pre-Spark 2.0 for the next 2 year so you must still learn and master RDDs and DataFrame for practical reasons.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.