Skew join

When working with data that has a highly uneven distribution, data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens:

> SET hive.optimize.skewjoin=true; --If there is data skew in join, set it to true. Default is false.> SET hive.skewjoin.key=100000; 
 --This is the default value. If the number of key is bigger than 
 --this, the new keys will send to the other unused reducers.
Skewed data could occur with the GROUP BY data too. To optimize it, we need set hive.groupby.skewindata=true to use the preceding settings to enable skew data optimization in the GROUP BY result. Once configured, Hive will ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.