O'Reilly logo

Scala Data Analysis Cookbook by Arun Manivannan

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Storing data as Parquet files

Parquet (https://parquet.apache.org/) is rapidly becoming the go-to data storage format in the world of big data because of the distinct advantages it offers:

  • It has a column-based representation of data. This is better represented in a picture, as follows:
    Storing data as Parquet files

    As you can see in the preceding screenshot, Parquet stores data in chunks of rows, say 100 rows. In Parquet terms, these are called RowGroups. Each of these RowGroups has chunks of columns inside them (or column chunks). Column chunks can hold more than a single unit of data for a particular column (as represented in the blue box in the first column). For example. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required