Creating DataFrames from Scala data structures

In this recipe, we explore the DataFrame API, which provides a higher level of abstraction than RDDs for working with data. The API is similar to R and Python data frame facilities (pandas).

DataFrame simplifies coding and lets you use standard SQL to retrieve and manipulate data. Spark keeps additional information about DataFrames, which helps the API to manipulate the frames with ease. Every DataFrame will have a schema (either inferred from data or explicitly defined) which allows us to view the frame like an SQL table. The secret sauce of SparkSQL and DataFrame is that the catalyst optimizer will work behind the scenes to optimize access by rearranging calls in the pipeline.

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.