Summary

Hive, through its query language HiveQL, brings in SQL and Relational database concepts to Hadoop. The primary use case for Hive is data warehousing and analytical querying for applications such as Business Intelligence. The supporting components of Hive are built to assist this use case. For example, row-columnar file formats are very efficient when performing aggregations on columns.

The key takeaways from this chapter are as follows:

  • In Hive, a close look has to be kept on the file format used by the underlying table. Text files can be inefficient. Sequence files are better off as they are compressed. Specialized files such as RC and ORC are more suited both in terms of I/O and query performance.
  • Compression brings in efficiency. Both ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.