Working with Parquet

In this section, we will discuss and talk about various operations provided by Spark SQL for working with Parquet data formats with appropriate examples.

Parquet is one of popular columnar data storage format for storing the structured data. Parquet leverages the record shredding and assembly algorithm (http://tinyurl.com/p8kaawg) as described in the Dremel paper (http://research.google.com/pubs/pub36632.html). Parquet supports efficient compression and encoding schemes which is better than just simple flattening of structured tables. Refer to https://parquet.apache.org/ for more information on the Parquet data format.

The DataFrame API of Spark SQL provides convenience operations for writing and reading data in the Parquet ...

Get Real-Time Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.