Chapter 4. Spark SQL

Spark SQL is a Spark module for processing a structured data. This chapter is divided into the following recipes:

  • Understanding the Catalyst optimizer
  • Creating HiveContext
  • Inferring schema using case classes
  • Programmatically specifying the schema
  • Loading and saving data using the Parquet format
  • Loading and saving data using the JSON format
  • Loading and saving data from relational databases
  • Loading and saving data from an arbitrary source

Introduction

Spark can process data from various data sources such as HDFS, Cassandra, HBase, and relational databases, including HDFS. Big data frameworks (unlike relational database systems) do not enforce schema while writing. HDFS is a perfect example where any arbitrary file is welcome during the ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.