Data file optimization

Data file optimization covers the performance improvement on the data files in terms of file format, compression, and storage.

File format

Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats. The three ways to specify the file format are as follows:

  • CREATE TABLE ... STORE AS <File_Format>
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT <File_Format>
  • SET hive.default.fileformat=<File_Format> --default fileformat for table

Here, <File_Type> is TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET.

We can load a text file directly to a table with the TEXTFILE format. To load data to the table with other file formats, we need to load the data to a TEXTFILE format table first. Then, use INSERT OVERWRITE ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.