A. Built-in User Defined Functions and Piggybank by Alan Gates

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Appendix A. Built-in User Defined Functions and Piggybank

This appendix covers UDFs that come as part of the Pig distribution, including built-in UDFs and user-contributed UDFs in Piggybank.

Built-in UDFs

Pig comes prepackaged with many UDFs that can be used directly in Pig without using register or define. These include load, store, evaluation, and filter functions.

Built-in Load and Store Functions

Pig’s built-in load functions are listed in Table A-1; Table A-2 lists the store functions.

Table A-1. Load functions

FunctionLocation String indicatesConstructor argumentsDescription
HBaseStorageHBase table

The first argument is a string describing column family and column to Pig field mapping.

The second is an option string (optional).

Load data from HBase (see HBase).
PigStorageHDFS fileThe first argument is a field separator (optional; defaults to Tab).Load text data from HDFS (see Load).
TextLoaderHDFS fileNone.Reads lines of text, each line as a tuple with one chararray field.

Table A-2. Store functions

FunctionLocation String indicatesConstructor argumentsDescription
HBaseStorageHBase table

The first argument is a string describing Pig field to HBase column family and column mapping.

The second is an option string (optional).

Store data to HBase (see HBase).
PigStorageHDFS fileThe first argument is a field separator (optional; defaults to Tab).Store text to HDFS in text format (see Store).

Built-in Evaluation and Filter Functions

The evaluation functions can be divided into math functions that mimic many of the Java math functions; aggregate functions that take a bag of values and produce a single result; functions that operate on or produce complex types; chararray and bytearray functions; filter functions; and miscellaneous functions.

Each of the built-in evaluation and filter functions is discussed in the following lists. In these lists, for brevity, a bag of tuples with a given type is specified by braces surrounding parentheses and a list of the tuples’ fields. For example, a bag of tuples with one integer field is denoted as {(int)}.

Built-in math UDFs

double ABS(double input)
Parameter:

input

Returns:

Absolute value

Since version:

0.8

double ACOS(double input)
Parameter:

input

Returns:

Arc cosine

Since version:

0.8

double ASIN(double input)
Parameter:

input

Returns:

Arc sine

Since version:

0.8

double ATAN(double input)
Parameter:

input

Returns:

Arc tangent

Since version:

0.8

double CBRT(double input)
Parameter:

input

Returns:

Cube root

Since version:

0.8

double CEIL(double input)
Parameter:

input

Returns:

Next-highest double value that is a mathematical integer

Since version:

0.8

double COS(double input)
Parameter:

input

Returns:

Cosine

Since version:

0.8

double COSH(double input)
Parameter:

input

Returns:

Hyperbolic cosine

Since version:

0.8

double EXP(double input)
Parameter:

input

Returns:

Euler’s number (e) raised to the power of input

Since version:

0.8

double FLOOR(double input)
Parameter:

input

Returns:

Next-lowest double value that is a mathematical integer

Since version:

0.8

double LOG(double input)
Parameter:

input

Returns:

Natural logarithm of input

Since version:

0.8

double LOG10(double input)
Parameter:

input

Returns:

Logarithm base 10 of input

Since version:

0.8

long ROUND(double input)
Parameter:

input

Returns:

Long nearest to the value of input

Since version:

0.8

double SIN(double input)
Parameter:

input

Returns:

Sine

Since version:

0.8

double SINH(double input)
Parameter:

input

Returns:

Hyperbolic sine

Since version:

0.8

double SQRT(double input)
Parameter:

input

Returns:

Square root

Since version:

0.8

double TAN(double input)
Parameter:

input

Returns:

Tangent

Since version:

0.8

double TANH(double input)
Parameter:

input

Returns:

Hyperbolic tangent

Since version:

0.8

Built-in aggregate UDFs

int AVG({(int)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

long AVG({(long)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

float AVG({(float)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

double AVG({(double)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

double AVG({(bytearray)} input)
Parameter:

input

Returns:

Average of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

long COUNT

A version of COUNT that matches SQL semantics for COUNT(col)

Parameter:

input

Returns:

Number of records in input, excluding null values

Since version:

0.1

long COUNT_STAR

A version of COUNT that matches SQL semantics for COUNT(*)

Parameter:

input

Returns:

Number of all records in input, including null values

Since version:

0.4

int MAX({(int)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

long MAX({(long)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

float MAX({(float)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

double MAX({(double)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

chararray MAX
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

double MAX({(bytearray)} input)
Parameter:

input

Returns:

Maximum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

int MIN({(int)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

long MIN({(long)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

float MIN({(float)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

double MIN({(double)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

chararray MIN
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

double MIN({(bytearray)} input)
Parameter:

input

Returns:

Minimum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

long SUM({(int)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

long SUM({(long)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(float)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(double)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(bytearray)} input)
Parameter:

input

Returns:

Sum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

Built-in chararray and bytearray UDFs

chararray CONCAT(chararray c1, chararray c2)
Parameters:

c1

c2

Returns:

Concatenation of c1 and c2

Since version:

0.1

bytearray CONCAT(bytearray b1, bytearray b2)
Parameters:

b1

b2

Returns:

Concatenation of b1 and b2

Since version:

0.1

int INDEXOF(chararray source, chararray search)
Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the first instance of search in source; -1 if search is not in source

Since version:

0.8

int LAST_INDEX_OF(chararray source, chararray search)
Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the last instance of search in source; -1 if search is not in source

Since version:

0.8

chararray LCFIRST(chararray input)
Parameter:

input

Returns:

input, with the first character converted to lowercase

Since version:

0.8

chararray LOWER(chararray input)
Parameter:

input

Returns:

input with all characters converted to lowercase

Since version:

0.8

chararray REGEX_EXTRACT(chararray source, chararray regex, int n)
Parameters:

source: the chararray to search in

regex: the regular expression to search for

n: take the nth match, counting from 0

Returns:

nth subset of the source matching regex; null if there are no matches

Since version:

0.8

(chararray) REGEX_EXTRACT_ALL(chararray source, chararray regex)
Parameters:

source: the chararray to search in

regex: the regular expression to search for

Returns:

Tuple containing all subsets of source matching regex; null if there are no matches

Since version:

0.8

chararray REPLACE(chararray source, chararray toReplace, chararray newValue)
Parameters:

source: the chararray to search in

toReplace: the chararray to be replaced

newValue: the new chararray to replace it with

Returns:

source with all instances of toReplace changed to newValue

Since version:

0.8

long SIZE(chararray input)
Parameter:

input

Returns:

Number of characters in input

Since version:

0.2

long SIZE(bytearray input)
Parameter:

input

Returns:

Number of bytes in input

Since version:

0.2

(chararray) STRSPLIT(chararray source)

Split a chararray by whitespace

Parameter:

source: the chararray to split

Returns:

Tuple with one field for each section of source

Since version:

0.8

(chararray) STRSPLIT(chararray source, chararray regex)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

Returns:

Tuple with one field for each section of source

Since version:

0.8

(chararray) STRSPLIT(chararray source, chararray regex, int maxsplits)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

max: the maximum number of splits

Returns:

Tuple with one field for each section of source; if there are more than one maxsplits sections, only the first maxsplits sections will be in the tuple

Since version:

0.8

chararray SUBSTRING(chararray source, int start, int end)
Parameters:

source: the chararray to split

start: the start position (inclusive), counting from 0

end: the end position (exclusive), counting from 0

Returns:

Subchararray; error if any input value has a length shorter than start

Since version:

0.8

{(chararray)} TOKENIZE(chararray input)
Parameter:

source: the chararray to split

Returns:

input split on whitespace, with each resulting value being placed in its own tuple and all tuples placed in the bag

Since version:

0.1

chararray TRIM(chararray input)
Parameter:

input

Returns:

input with all leading and trailing whitespace removed

Since version:

0.8

chararray UCFIRST(chararray input)
Parameter:

input

Returns:

input with the first character converted to uppercase

Since version:

0.8

chararray UPPER(chararray input)
Parameter:

input

Returns:

input with all characters converted to uppercase

Since version:

0.8

Built-in complex type UDFs

{(chararray, chararray, double)} COR({(double)} b1, {(double)} b2)

Calculate the correlation between two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the correlation between b1 and b2

Since version:

0.8

{(chararray, chararray, double)} COV({(double)} b1, {(double)} b2)

Calculate the covariance of two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the covariance of b1 and b2

Since version:

0.8

bag DIFF(bag b1, bag b2)
Parameters:

b1

b2

Returns:

All records from b1 that are not in b2, and all records from b2 that are not in b1

Since version:

0.1

long SIZE(map input)
Parameter:

input

Returns:

Number of key-value pairs in input

Since version:

0.2

long SIZE(tuple input)
Parameter:

input

Returns:

Number of fields in input

Since version:

0.2

long SIZE(bag input)
Parameter:

input

Returns:

Number of tuples in input

Since version:

0.2

bag TOBAG(...)
Parameter:

Variable

Returns:

If all inputs have the same schema, the resulting bag will have that schema, else it will have a null schema; if the parameters are tuples, all schemas must have the same field names in addition to types

Since version:

0.8

map TOMAP(...)
Parameter:

Variable

Returns:

Input parameters are paired up and placed in a map as key/value, key/value; all keys must be chararrays; an odd number of arguments will result in an error

Since version:

0.9

bag TOP(int numRecords, int field, bag source)
Parameters:

numRecords: the number of records to return

field: the field to sort on

source: the bag to return records from

Returns:

A bag with numRecords

Since version:

0.8

tuple TOTUPLE(...)
Parameter:

Variable

Returns:

A tuple with all of the fields passed in as arguments

Since version:

0.8

Built-in filter functions

boolean IsEmpty(bag)
Parameter:

input

Returns:

Boolean

Since version:

0.1

boolean IsEmpty(tuple)
Parameter:

input

Returns:

Boolean

Since version:

0.1

Miscellaneous built-in UDF

double RANDOM()
Returns:

A random double between 0 and 1

Since version:

0.4

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required