O'Reilly logo

Apache Spark for Data Science Cookbook by Padma Priya Chitturi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Working with DataFrames

Spark SQL is a Spark module for structured data processing. It provides the programming abstraction called DataFrame (in earlier versions of Spark, it is called SchemaRDD) and also acts as distributed SQL query engine. The capabilities it provides are as follows:

  • It loads data from a variety of structured sources (for example, JSON, Hive, and Parquet)
  • It lets you query data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as BI tools like Tableau.
  • Spark SQL provides rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

A DataFrame ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required