Spark SQL is a Spark module for structured data processing. It provides the programming abstraction called DataFrame (in earlier versions of Spark, it is called SchemaRDD) and also acts as distributed SQL query engine. The capabilities it provides are as follows:
- It loads data from a variety of structured sources (for example, JSON, Hive, and Parquet)
- It lets you query data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as BI tools like Tableau.
- Spark SQL provides rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
A DataFrame ...