Analytics functions in Hive

Hive provides the following set of analytical functions:

  • RANK
  • DENSE_RANK
  • ROW_NUMBER
  • PERCENT_RANK
  • CUME_DIST
  • NTILE

Common and useful sets of analytical functions are ranking functions where rows from resultset are ranked according to a scheme.

How to do it…

Let's analyze each function in detail. We will be using the same sales dataset and applying analytical functions to it:

  • ROW_NUMBER: This function will provide a unique number to each row in resultset based on the ORDER BY clause within the PARTITION. For example, if we want to assign row_number to each fname, which is also partitioned by IP address in the sales dataset, the query would be:
    hive> select fname,ip,ROW_NUMBER() OVER (ORDER BY ip ) as rownum from sales;
    
  • RANK: It ...

Get Apache Hive Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.