Working with DataFrames

Spark SQL is a Spark module for structured data processing. It provides the programming abstraction called DataFrame (in earlier versions of Spark, it is called SchemaRDD) and also acts as distributed SQL query engine. The capabilities it provides are as follows:

  • It loads data from a variety of structured sources (for example, JSON, Hive, and Parquet)
  • It lets you query data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as BI tools like Tableau.
  • Spark SQL provides rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

A DataFrame ...

Get Apache Spark for Data Science Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.