Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

Hive, like other SQL databases, allows users to join various tables. Joins; however, can be computationally expensive, especially on big tables, and that’s precisely why they deserve an entire tip of their own!

If you haven’t yet seen it, be sure to look at my Introduction to Hive article.

For the remainder of this post, we will use two tables in our example queries. Their DDL looks like the following:

A join query on the above tables would look like:

We will now enumerate some of the best practices to use when performing joins in Hive.

Only use equalities in join conditions

Hive only supports equality joins. This is primarily because it’s rather difficult to express non-equality join conditions in MapReduce.

Largest table last

The earlier example shows an inner join between two tables. Given that inner joins are commutative; however, should the user prefer having t1 as the left table and t2 as the right table or vice-versa? The answer to this question lies in understanding how Hive compiles a join query to MapReduce. In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (id in above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key. During this reduce phase, records from all tables but the last table (in the join query) are buffered in memory while the records from query’s last table are streamed into the reducer. Consequently, in order to reduce the memory required by reducers, it’s recommended to put the largest tables last to ensure their records are streamed into the reducers.

Alternatively, you can specify a hint in your query about which table to stream. In the above example, if t1 is larger than t2, your query the hint for streaming t1 would look like:

Make use of map joins where possible

If one of the tables in the join query is small enough to fit in memory, we recommend you to use map join. Map join is a special type of join where the smaller table is loaded in memory and the join is performed in the map phase of the MapReduce job. Since no reducers are required, map joins are much faster compared to regular joins.

In the above example, if t1 is small enough to fit in memory, the map join query would look like:

Starting Hive 0.7, there is no need to specify the query hint (i.e. /*+ MAPJOIN(t1) */) for map joins. Set the property to true in hive-site.xml or your Hive CLI session, and Hive will use a conditional task to determine if it is possible to run a map join instead of a regular join. If so, it will run a map-join; if not, a regular join is run.

There are many topics related to Hive joins like bucketed and sort-merge joins, left semi joins, deciding between what goes in the “in” clause and what goes in the “where” clause of an outer join query, which we haven’t discussed in this blog post. To learn more about those, refer to the Hive wiki page.

In the next blog post, we will learn how to write and plug in custom logic into our Hive scripts!

Safari Books Online has the content you need

Below are some Hive books to help you develop applications, or you can check out all of the Hive books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

Programming Hive introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem.
If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Mark Grover is a contributor to the Apache Hive project and an active respondent on Hive’s mailing list and IRC channel. He is a section author of O’Reilly’s book on Hive called, Programming Hive. He works as a Software Developer at Cloudera and is also a contributor to the Apache Bigtop project.

Tags: Big Data, databases, Hive, join query, MapReduce,

Comments are closed.