Top K statistics in Hive

It is the mechanism of collecting the top K column values of a Hive table. In this, the top K values of the most skewed column are stored in the partition. This is applicable for both existing and newly created tables.

How to do it…

Top K statistics computation is disabled by default. The following are some of the properties that could be set to compute and store top K statistics:

  • hive.stats.topk.collect

    This would enable computing top K and putting it into skewed information:

    • Default Value: false
    • Valid Values: true, false
  • hive.stats.topk.num
    • Using this property, you can specify K value for your top K result
  • hive.stats.topk.minpercent
    • It is the minimal percentage of a row value to be in top K result
    • It could be any float value between ...

Get Apache Hive Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.