Column statistics in Hive

Similar to table and partition statistics, Hive also supports the analysis of column statistics. The following are the statistics captured by Hive when a column or set of columns are analyzed:

  • The number of distinct values
  • The number of NULL values
  • Minimum or maximum K values where K could be given by a user
  • Histogram: frequency and height balanced
  • Average size of the column
  • Average or sum of all values in the column if their type is numerical
  • Percentiles of the value

How to do it…

As discussed in the previous recipe, Hive provides the analyze command to compute table or partition statistics. The same command could be used to compute statistics for one or more column of a Hive table or partition. The HiveQL in order to compute ...

Get Apache Hive Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.