Get full access to Apache Hive Essentials and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Skew join

When working with data that has a highly uneven distribution, data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens:

> SET hive.optimize.skewjoin=true; --If there is data skew in join, set it to true. Default is false.> SET hive.skewjoin.key=100000; 
 --This is the default value. If the number of key is bigger than 
 --this, the new keys will send to the other unused reducers.

Skewed data could occur with the GROUP BY data too. To optimize it, we need set hive.groupby.skewindata=true to use the preceding settings to enable skew data optimization in the GROUP BY result. Once configured, Hive will ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Get it now

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Start your free trial Become a member now