DataFrame - a natural evolution to unite API and SQL via a high-level API

The Spark developer community has always strived to provide an easy-to-use high-level API for the community starting from the AMPlab days at Berkley. The next evolution in the Data API materialized when Michael Armbrust gave the community the SparkSQL and Catalyst optimizer, which made data virtualization possible with Spark using a simple and well-understood SQL interface. The DataFrame API was a natural evolution to take advantage of SparkSQL by organizing data into named columns like relational tables.

The DataFrame API made data wrangling via SQL available to a multitude of data scientists and developers familiar with DataFrames in R (data.frame) or Python/Pandas ...

Get Apache Spark 2.x Machine Learning Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.