It is time to turn our attention to how you can
extend Pig. So far we have focused on the operators and functions Pig
provides. But Pig also makes it easy for you to add your own processing
logic via user-defined functions (UDFs). These are written in Java and
scripting languages. This chapter will walk through how you can build
evaluation functions, or UDFs that operate on single
elements of data or collections of data. It will also cover how to write
filter functions, which are UDFs that can be used as
UDFs are powerful tools, and thus the interfaces are somewhat complex. In designing Pig, a central goal was to make easy things easy and hard things possible. So, the simplest UDFs can be implemented in a single method, but you will have to implement a few more methods to take advantage of more advanced features. We will cover both cases in this chapter.
Throughout this chapter we will use several running examples of UDFs. Some of these are built-in Pig UDFs, which can be found in your Pig distribution at src/org/apache/pig/builtin/. The others can be found on GitHub with the other example UDFs, in the directory udfs.
Pig and Hadoop are implemented in Java, and so it is natural to implement UDFs in Java. This allows UDFs access to the Hadoop APIs and to many of Pig’s facilities.
Before diving into the details, it is worth considering names. Pig locates a UDF by looking ...