Appendix A. Built-in User Defined Functions and Piggybank

This appendix covers UDFs that come as part of the Pig distribution, including built-in UDFs and user-contributed UDFs in Piggybank.

Built-in UDFs

Pig comes prepackaged with many UDFs that can be used directly in Pig without using register or define. These include load, store, evaluation, and filter functions.

Built-in Load and Store Functions

Pig’s built-in load functions are listed in Table A-1; Table A-2 lists the store functions.

Table A-1. Load functions

Function Location String indicates Constructor arguments Description

Function	Location String indicates	Constructor arguments	Description
`HBaseStorage`	HBase table	The first argument is a string describing column family and column to Pig field mapping. The second is an option string (optional).	Load data from HBase (see HBase).
`PigStorage`	HDFS file	The first argument is a field separator (optional; defaults to Tab).	Load text data from HDFS (see Load).
`TextLoader`	HDFS file	None.	Reads lines of text, each line as a tuple with one chararray field.

HBaseStorage

HBase table

The first argument is a string describing column family and column to Pig field mapping.

The second is an option string (optional).

Load data from HBase (see HBase).

PigStorage HDFS file The first argument is a field separator (optional; defaults to Tab). Load text data from HDFS (see Load).

TextLoader HDFS file None. Reads lines of text, each line as a tuple with one chararray field.

Table A-2. Store functions

Function Location String indicates Constructor arguments Description

Function	Location String indicates	Constructor arguments	Description
`HBaseStorage`	HBase table	The first argument is a string describing Pig field to HBase column family and column mapping. The second is an option string (optional).	Store data to HBase (see HBase).
`PigStorage`	HDFS file	The first argument is a field separator (optional; defaults to Tab).	Store text to HDFS in text format (see Store).

HBaseStorage

HBase table

The first argument is a string describing Pig field to HBase column family and column mapping.

The second is an option string (optional).

Store data to HBase (see HBase).

PigStorage HDFS file The first argument is a field separator (optional; defaults to Tab). Store text to HDFS in text format (see Store).

Built-in Evaluation and Filter Functions

The evaluation functions can be divided into math functions that mimic many of the Java math functions; aggregate functions that take a bag of values and produce a single result; functions that operate on or produce complex types; chararray and bytearray functions; filter functions; and miscellaneous functions.

Each of the built-in evaluation and filter functions is discussed in the following lists. In these lists, for brevity, a bag of tuples with a given type is specified by braces surrounding parentheses and a list of the tuples’ fields. For example, a bag of tuples with one integer field is denoted as {(int)}.

Built-in math UDFs

double ABS(double input)

Parameter:: input
Returns:: Absolute value
Since version:: 0.8

double ACOS(double input)

Parameter:: input
Returns:: Arc cosine
Since version:: 0.8

double ASIN(double input)

Parameter:: input
Returns:: Arc sine
Since version:: 0.8

double ATAN(double input)

Parameter:: input
Returns:: Arc tangent
Since version:: 0.8

double CBRT(double input)

Parameter:: input
Returns:: Cube root
Since version:: 0.8

double CEIL(double input)

Parameter:: input
Returns:: Next-highest double value that is a mathematical integer
Since version:: 0.8

double COS(double input)

Parameter:: input
Returns:: Cosine
Since version:: 0.8

double COSH(double input)

Parameter:: input
Returns:: Hyperbolic cosine
Since version:: 0.8

double EXP(double input)

Parameter:: input
Returns:: Euler’s number (e) raised to the power of input
Since version:: 0.8

double FLOOR(double input)

Parameter:: input
Returns:: Next-lowest double value that is a mathematical integer
Since version:: 0.8

double LOG(double input)

Parameter:: input
Returns:: Natural logarithm of input
Since version:: 0.8

double LOG10(double input)

Parameter:: input
Returns:: Logarithm base 10 of input
Since version:: 0.8

long ROUND(double input)

Parameter:: input
Returns:: Long nearest to the value of input
Since version:: 0.8

double SIN(double input)

Parameter:: input
Returns:: Sine
Since version:: 0.8

double SINH(double input)

Parameter:: input
Returns:: Hyperbolic sine
Since version:: 0.8

double SQRT(double input)

Parameter:: input
Returns:: Square root
Since version:: 0.8

double TAN(double input)

Parameter:: input
Returns:: Tangent
Since version:: 0.8

double TANH(double input)

Parameter:: input
Returns:: Hyperbolic tangent
Since version:: 0.8

Built-in aggregate UDFs

int AVG({(int)} input)

Parameter:: input
Returns:: Average of all values in input; nulls are ignored
Since version:: 0.2

long AVG({(long)} input)

Parameter:: input
Returns:: Average of all values in input; nulls are ignored
Since version:: 0.2

float AVG({(float)} input)

Parameter:: input
Returns:: Average of all values in input; nulls are ignored
Since version:: 0.2

double AVG({(double)} input)

Parameter:: input
Returns:: Average of all values in input; nulls are ignored
Since version:: 0.2

double AVG({(bytearray)} input)

Parameter:: input
Returns:: Average of all bytearrays, cast to doubles, in input; nulls are ignored
Since version:: 0.1

long COUNT

A version of COUNT that matches SQL semantics for COUNT(col)

Parameter:: input
Returns:: Number of records in input, excluding null values
Since version:: 0.1

long COUNT_STAR

A version of COUNT that matches SQL semantics for COUNT(*)

Parameter:: input
Returns:: Number of all records in input, including null values
Since version:: 0.4

int MAX({(int)} input)

Parameter:: input
Returns:: Maximum value in input; nulls are ignored
Since version:: 0.2

long MAX({(long)} input)

Parameter:: input
Returns:: Maximum value in input; nulls are ignored
Since version:: 0.2

float MAX({(float)} input)

Parameter:: input
Returns:: Maximum value in input; nulls are ignored
Since version:: 0.2

double MAX({(double)} input)

Parameter:: input
Returns:: Maximum value in input; nulls are ignored
Since version:: 0.2

chararray MAX

Parameter:: input
Returns:: Maximum value in input; nulls are ignored
Since version:: 0.2

double MAX({(bytearray)} input)

Parameter:: input
Returns:: Maximum of all bytearrays, cast to doubles, in input; nulls are ignored
Since version:: 0.1

int MIN({(int)} input)

Parameter:: input
Returns:: Minimum value in input; nulls are ignored
Since version:: 0.2

long MIN({(long)} input)

Parameter:: input
Returns:: Minimum value in input; nulls are ignored
Since version:: 0.2

float MIN({(float)} input)

Parameter:: input
Returns:: Minimum value in input; nulls are ignored
Since version:: 0.2

double MIN({(double)} input)

Parameter:: input
Returns:: Minimum value in input; nulls are ignored
Since version:: 0.2

chararray MIN

Parameter:: input
Returns:: Minimum value in input; nulls are ignored
Since version:: 0.2

double MIN({(bytearray)} input)

Parameter:: input
Returns:: Minimum of all bytearrays, cast to doubles, in input; nulls are ignored
Since version:: 0.1

long SUM({(int)} input)

Parameter:: input
Returns:: Sum of all values in the bag; nulls are ignored
Since version:: 0.2

long SUM({(long)} input)

Parameter:: input
Returns:: Sum of all values in the bag; nulls are ignored
Since version:: 0.2

double SUM({(float)} input)

Parameter:: input
Returns:: Sum of all values in the bag; nulls are ignored
Since version:: 0.2

double SUM({(double)} input)

Parameter:: input
Returns:: Sum of all values in the bag; nulls are ignored
Since version:: 0.2

double SUM({(bytearray)} input)

Parameter:: input
Returns:: Sum of all bytearrays, cast to doubles, in input; nulls are ignored
Since version:: 0.1

Built-in chararray and bytearray UDFs

chararray CONCAT(chararray c1, chararray c2)

Parameters:

c1

c2

Returns:

Concatenation of c1 and c2

Since version:

0.1

bytearray CONCAT(bytearray b1, bytearray b2)

Parameters:

b1

b2

Returns:

Concatenation of b1 and b2

Since version:

0.1

int INDEXOF(chararray source, chararray search)

Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the first instance of search in source; -1 if search is not in source

Since version:

0.8

int LAST_INDEX_OF(chararray source, chararray search)

Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the last instance of search in source; -1 if search is not in source

Since version:

0.8

chararray LCFIRST(chararray input)

Parameter:: input
Returns:: input, with the first character converted to lowercase
Since version:: 0.8

chararray LOWER(chararray input)

Parameter:: input
Returns:: input with all characters converted to lowercase
Since version:: 0.8

chararray REGEX_EXTRACT(chararray source, chararray regex, int n)

Parameters:

source: the chararray to search in

regex: the regular expression to search for

n: take the nth match, counting from 0

Returns:

nth subset of the source matching regex; null if there are no matches

Since version:

0.8

(chararray) REGEX_EXTRACT_ALL(chararray source, chararray regex)

Parameters:

source: the chararray to search in

regex: the regular expression to search for

Returns:

Tuple containing all subsets of source matching regex; null if there are no matches

Since version:

0.8

chararray REPLACE(chararray source, chararray toReplace, chararray newValue)

Parameters:

source: the chararray to search in

toReplace: the chararray to be replaced

newValue: the new chararray to replace it with

Returns:

source with all instances of toReplace changed to newValue

Since version:

0.8

long SIZE(chararray input)

Parameter:: input
Returns:: Number of characters in input
Since version:: 0.2

long SIZE(bytearray input)

Parameter:: input
Returns:: Number of bytes in input
Since version:: 0.2

(chararray) STRSPLIT(chararray source)

Split a chararray by whitespace

Parameter:: source: the chararray to split
Returns:: Tuple with one field for each section of source
Since version:: 0.8

(chararray) STRSPLIT(chararray source, chararray regex)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

Returns:

Tuple with one field for each section of source

Since version:

0.8

(chararray) STRSPLIT(chararray source, chararray regex, int maxsplits)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

max: the maximum number of splits

Returns:

Tuple with one field for each section of source; if there are more than one maxsplits sections, only the first maxsplits sections will be in the tuple

Since version:

0.8

chararray SUBSTRING(chararray source, int start, int end)

Parameters:

source: the chararray to split

start: the start position (inclusive), counting from 0

end: the end position (exclusive), counting from 0

Returns:

Subchararray; error if any input value has a length shorter than start

Since version:

0.8

{(chararray)} TOKENIZE(chararray input)

Parameter:: source: the chararray to split
Returns:: input split on whitespace, with each resulting value being placed in its own tuple and all tuples placed in the bag
Since version:: 0.1

chararray TRIM(chararray input)

Parameter:: input
Returns:: input with all leading and trailing whitespace removed
Since version:: 0.8

chararray UCFIRST(chararray input)

Parameter:: input
Returns:: input with the first character converted to uppercase
Since version:: 0.8

chararray UPPER(chararray input)

Parameter:: input
Returns:: input with all characters converted to uppercase
Since version:: 0.8

Built-in complex type UDFs

{(chararray, chararray, double)} COR({(double)} b1, {(double)} b2)

Calculate the correlation between two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the correlation between b1 and b2

Since version:

0.8

{(chararray, chararray, double)} COV({(double)} b1, {(double)} b2)

Calculate the covariance of two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the covariance of b1 and b2

Since version:

0.8

bag DIFF(bag b1, bag b2)

Parameters:

b1

b2

Returns:

All records from b1 that are not in b2, and all records from b2 that are not in b1

Since version:

0.1

long SIZE(map input)

Parameter:: input
Returns:: Number of key-value pairs in input
Since version:: 0.2

long SIZE(tuple input)

Parameter:: input
Returns:: Number of fields in input
Since version:: 0.2

long SIZE(bag input)

Parameter:: input
Returns:: Number of tuples in input
Since version:: 0.2

bag TOBAG(...)

Parameter:: Variable
Returns:: If all inputs have the same schema, the resulting bag will have that schema, else it will have a null schema; if the parameters are tuples, all schemas must have the same field names in addition to types
Since version:: 0.8

map TOMAP(...)

Parameter:: Variable
Returns:: Input parameters are paired up and placed in a map as key/value, key/value; all keys must be chararrays; an odd number of arguments will result in an error
Since version:: 0.9

bag TOP(int numRecords, int field, bag source)

Parameters:

numRecords: the number of records to return

field: the field to sort on

source: the bag to return records from

Returns:

A bag with numRecords

Since version:

0.8

tuple TOTUPLE(...)

Parameter:: Variable
Returns:: A tuple with all of the fields passed in as arguments
Since version:: 0.8

Built-in filter functions

boolean IsEmpty(bag)

Parameter:: input
Returns:: Boolean
Since version:: 0.1

boolean IsEmpty(tuple)

Parameter:: input
Returns:: Boolean
Since version:: 0.1

Miscellaneous built-in UDF

double RANDOM()

Returns:: A random double between 0 and 1
Since version:: 0.4

Piggybank

Piggybank is Pig’s repository of user-contributed functions. Piggybank functions are distributed as part of the Pig distribution, but they are not built in. You must register the Piggybank JAR to use them, which you can do in your distribution at contrib/piggybank/java/piggybank.jar.

At the time of writing, there is no central website or set of documentation for Piggybank. To find out what is in there, you will need to browse through the code. You can see all of the included functions by looking in your distribution under contrib/piggybank/. Piggybank does not yet include any Python functions, but it is set up to allow users to contribute functions in languages other than Java, so hopefully this will change in time.

Get Programming Pig now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Programming Pig by Alan Gates

Appendix A. Built-in User Defined Functions and Piggybank

Built-in UDFs

Built-in Load and Store Functions

Built-in Evaluation and Filter Functions

Built-in math UDFs

Built-in aggregate UDFs

Built-in chararray and bytearray UDFs

Built-in complex type UDFs

Built-in filter functions

Miscellaneous built-in UDF

Piggybank

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly