Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

In our initial blog post, we learned how to use pre-existing UDFs in Hive queries. Often times; however, there is a need for using some custom logic in your Hive queries that doesn’t already exist in one of the existing UDFs. For such cases, Hive offers a pluggable UDF interface, which makes it easy for users to create their own UDFs. UDFs are written in Java and once compiled and registered, can be used just like any pre-existing UDFs in Hive queries.

Here is an example of a UDF that I recently wrote that you may use as a reference.
First off, your UDF will need to inherit from the GenericUDF class. This class can be annotated with several annotations:

  • @UDFType(deterministic = true/false) will state whether your UDF is deterministic or not. This is set to true by default. Deterministic functions are those that always return the same result whenever they are called with the same set of arguments. For example avg() is a deterministic function because the result will always be the same for the same input. However, unix_timestamp() (without any arguments) is not deterministic because it returns the current time as a unix timestamp using the default timezone. Consequently, the result depends on when this function is called.
  • @Description(name=”my_udf”, value=”output of describe command”, extended=”output of describe extended command”). This annotation is used to set the name of the UDF and text to be displayed when a DESCRIBE FUNCTION my_udf or DESCRIBE FUNCTION EXTENDED my_udf are issued. These commands print out a description of the UDF.

Secondly, there are three methods you will need to implement in your UDF class. These methods override the abstract methods with the same name in the GenericUDF class. These methods are:

public ObjectInspector initialize(ObjectInspector[] arguments)
This method will be called only once during the lifetime of the UDF. The goal of this method is to check for the validity (number, type, etc.) of the arguments being passed to the UDF. It also sets the return type of the result of the UDF. You may throw UDFArgumentLengthException or UDFArgumentTypeException if the wrong number or wrong type of arguments were passed to the UDF respectively.
public Object evaluate(DeferredObject[] arguments)
This will be the method where most of your custom logic would go. You would need to extract arguments from the passed in DeferredObject array and then perform your custom logic to return a scalar value (more on returning non-scalar values later).
public String getDisplayString(String[] children)
This method contains the string to display when an explain or explain extended command is issued on a query containing the UDF.

Once you have implemented the above three methods, you will have to compile your UDF into a JAR. For compiling your UDF, you will need to include hive-serde.jar and hive-exec.jar in your classpath. Both of these JARs are available under $HIVE_HOME/lib/, where $HIVE_HOME refers to the home directory of your Hive installation (e.g. /usr/lib/hive). Depending on which writables you are using, you may also need to include hadoop-core.jar, which is available under $HADOOP_HOME/lib, where $HADOOP_HOME refers to the home directory of your Hadoop installation (e.g. /usr/lib/hadoop).

Once you have compiled your UDF, you will need to register it with Hive before using it. You can do so by running the following on Hive CLI:

You will then be able to use your newly created UDF like any other pre-existing Hive UDFs. Keep in mind that you will have to run the above commands once for every time you start a new Hive client session. If you would like your UDF to be available by default, you will have to add it to the function registry and recompile Hive.

Note: For a query like select my_udf from my_table, there are a couple things users should be aware of:

  • UDFs operate on a single record scope. This means that any given call to a UDF will process one record at any given time.
  • UDFs are called during the map phase of the MapReduce job. This means that the user has no control over the order in which the records get sent to the UDF. There is an order by which mappers create file splits but since that order may change, it's not recommended to implicitly make use of such an order.

Given the above context, the user should not do any “aggregation” in a UDF. If you would like to write a custom UDF that does aggregation across multiple rows, you would need to write a User Defined Aggregate Function (UDAF). If you would like to return multiple output records for every single record passed in as input to your UDF, you should implement a User Defined Table Generating Function (UDTF).

Another way to plug in custom logic in Hive queries is by making use of Hive transform functionality. You can learn more about it on Hive's wiki.

Safari Books Online has the content you need

Below are some Hive books to help you develop applications, or you can check out all of the Hive books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

Programming Hive introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem.
If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google's BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you'll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Mark Grover is a contributor to the Apache Hive project and an active respondent on Hive's mailing list and IRC channel. He is a section author of O'Reilly's book on Hive called, Programming Hive. He works as a Software Developer at Cloudera and is also a contributor to the Apache Bigtop project.

Tags: Big Data, databases, Hadoop, HBase, Hive, queries, UDFs,

Comments are closed.