You are previewing Programming Pig.

Programming Pig

Cover of Programming Pig by Alan Gates Published by O'Reilly Media, Inc.
  1. Programming Pig
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. Preface
      1. Data Addiction
      2. Who Should Read This Book
      3. Conventions Used in This Book
      4. Code Examples in This Book
      5. Using Code Examples
      6. Safari® Books Online
      7. How to Contact Us
      8. Acknowledgments
    3. 1. Introduction
      1. What Is Pig?
      2. Pig’s History
    4. 2. Installing and Running Pig
      1. Downloading and Installing Pig
      2. Running Pig
    5. 3. Grunt
      1. Entering Pig Latin Scripts in Grunt
      2. HDFS Commands in Grunt
      3. Controlling Pig from Grunt
    6. 4. Pig’s Data Model
      1. Types
      2. Schemas
    7. 5. Introduction to Pig Latin
      1. Preliminary Matters
      2. Input and Output
      3. Relational Operations
      4. User Defined Functions
    8. 6. Advanced Pig Latin
      1. Advanced Relational Operations
      2. Integrating Pig with Legacy Code and MapReduce
      3. Nonlinear Data Flows
      4. Controlling Execution
      5. Pig Latin Preprocessor
    9. 7. Developing and Testing Pig Latin Scripts
      1. Development Tools
      2. Testing Your Scripts with PigUnit
    10. 8. Making Pig Fly
      1. Writing Your Scripts to Perform Well
      2. Writing Your UDF to Perform
      3. Tune Pig and Hadoop for Your Job
      4. Using Compression in Intermediate Results
      5. Data Layout Optimization
      6. Bad Record Handling
    11. 9. Embedding Pig Latin in Python
      1. Compile
      2. Bind
      3. Run
      4. Utility Methods
    12. 10. Writing Evaluation and Filter Functions
      1. Writing an Evaluation Function in Java
      2. Algebraic Interface
      3. Accumulator Interface
      4. Python UDFs
      5. Writing Filter Functions
    13. 11. Writing Load and Store Functions
      1. Load Functions
      2. Store Functions
    14. 12. Pig and Other Members of the Hadoop Community
      1. Pig and Hive
      2. Cascading
      3. NoSQL Databases
      4. Metadata in Hadoop
    15. A. Built-in User Defined Functions and Piggybank
      1. Built-in UDFs
      2. Piggybank
    16. B. Overview of Hadoop
      1. MapReduce
      2. Hadoop Distributed File System
    17. Index
    18. About the Author
    19. Colophon
    20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Appendix A. Built-in User Defined Functions and Piggybank

This appendix covers UDFs that come as part of the Pig distribution, including built-in UDFs and user-contributed UDFs in Piggybank.

Built-in UDFs

Pig comes prepackaged with many UDFs that can be used directly in Pig without using register or define. These include load, store, evaluation, and filter functions.

Built-in Load and Store Functions

Pig’s built-in load functions are listed in Table A-1; Table A-2 lists the store functions.

Table A-1. Load functions

FunctionLocation String indicatesConstructor argumentsDescription
HBaseStorageHBase table

The first argument is a string describing column family and column to Pig field mapping.

The second is an option string (optional).

Load data from HBase (see HBase).
PigStorageHDFS fileThe first argument is a field separator (optional; defaults to Tab).Load text data from HDFS (see Load).
TextLoaderHDFS fileNone.Reads lines of text, each line as a tuple with one chararray field.

Table A-2. Store functions

FunctionLocation String indicatesConstructor argumentsDescription
HBaseStorageHBase table

The first argument is a string describing Pig field to HBase column family and column mapping.

The second is an option string (optional).

Store data to HBase (see HBase).
PigStorageHDFS fileThe first argument is a field separator (optional; defaults to Tab).Store text to HDFS in text format (see Store).

Built-in Evaluation and Filter Functions

The evaluation functions can be divided into math functions that mimic many of the Java math functions; aggregate functions that take a bag of values and produce a single result; functions that operate on or produce complex types; chararray and bytearray functions; filter functions; and miscellaneous functions.

Each of the built-in evaluation and filter functions is discussed in the following lists. In these lists, for brevity, a bag of tuples with a given type is specified by braces surrounding parentheses and a list of the tuples’ fields. For example, a bag of tuples with one integer field is denoted as {(int)}.

Built-in math UDFs

double ABS(double input)
Parameter:

input

Returns:

Absolute value

Since version:

0.8

double ACOS(double input)
Parameter:

input

Returns:

Arc cosine

Since version:

0.8

double ASIN(double input)
Parameter:

input

Returns:

Arc sine

Since version:

0.8

double ATAN(double input)
Parameter:

input

Returns:

Arc tangent

Since version:

0.8

double CBRT(double input)
Parameter:

input

Returns:

Cube root

Since version:

0.8

double CEIL(double input)
Parameter:

input

Returns:

Next-highest double value that is a mathematical integer

Since version:

0.8

double COS(double input)
Parameter:

input

Returns:

Cosine

Since version:

0.8

double COSH(double input)
Parameter:

input

Returns:

Hyperbolic cosine

Since version:

0.8

double EXP(double input)
Parameter:

input

Returns:

Euler’s number (e) raised to the power of input

Since version:

0.8

double FLOOR(double input)
Parameter:

input

Returns:

Next-lowest double value that is a mathematical integer

Since version:

0.8

double LOG(double input)
Parameter:

input

Returns:

Natural logarithm of input

Since version:

0.8

double LOG10(double input)
Parameter:

input

Returns:

Logarithm base 10 of input

Since version:

0.8

long ROUND(double input)
Parameter:

input

Returns:

Long nearest to the value of input

Since version:

0.8

double SIN(double input)
Parameter:

input

Returns:

Sine

Since version:

0.8

double SINH(double input)
Parameter:

input

Returns:

Hyperbolic sine

Since version:

0.8

double SQRT(double input)
Parameter:

input

Returns:

Square root

Since version:

0.8

double TAN(double input)
Parameter:

input

Returns:

Tangent

Since version:

0.8

double TANH(double input)
Parameter:

input

Returns:

Hyperbolic tangent

Since version:

0.8

Built-in aggregate UDFs

int AVG({(int)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

long AVG({(long)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

float AVG({(float)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

double AVG({(double)} input)
Parameter:

input

Returns:

Average of all values in input; nulls are ignored

Since version:

0.2

double AVG({(bytearray)} input)
Parameter:

input

Returns:

Average of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

long COUNT

A version of COUNT that matches SQL semantics for COUNT(col)

Parameter:

input

Returns:

Number of records in input, excluding null values

Since version:

0.1

long COUNT_STAR

A version of COUNT that matches SQL semantics for COUNT(*)

Parameter:

input

Returns:

Number of all records in input, including null values

Since version:

0.4

int MAX({(int)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

long MAX({(long)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

float MAX({(float)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

double MAX({(double)} input)
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

chararray MAX
Parameter:

input

Returns:

Maximum value in input; nulls are ignored

Since version:

0.2

double MAX({(bytearray)} input)
Parameter:

input

Returns:

Maximum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

int MIN({(int)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

long MIN({(long)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

float MIN({(float)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

double MIN({(double)} input)
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

chararray MIN
Parameter:

input

Returns:

Minimum value in input; nulls are ignored

Since version:

0.2

double MIN({(bytearray)} input)
Parameter:

input

Returns:

Minimum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

long SUM({(int)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

long SUM({(long)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(float)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(double)} input)
Parameter:

input

Returns:

Sum of all values in the bag; nulls are ignored

Since version:

0.2

double SUM({(bytearray)} input)
Parameter:

input

Returns:

Sum of all bytearrays, cast to doubles, in input; nulls are ignored

Since version:

0.1

Built-in chararray and bytearray UDFs

chararray CONCAT(chararray c1, chararray c2)
Parameters:

c1

c2

Returns:

Concatenation of c1 and c2

Since version:

0.1

bytearray CONCAT(bytearray b1, bytearray b2)
Parameters:

b1

b2

Returns:

Concatenation of b1 and b2

Since version:

0.1

int INDEXOF(chararray source, chararray search)
Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the first instance of search in source; -1 if search is not in source

Since version:

0.8

int LAST_INDEX_OF(chararray source, chararray search)
Parameters:

source: the chararray to search in

search: the chararray to search for

Returns:

Index of the last instance of search in source; -1 if search is not in source

Since version:

0.8

chararray LCFIRST(chararray input)
Parameter:

input

Returns:

input, with the first character converted to lowercase

Since version:

0.8

chararray LOWER(chararray input)
Parameter:

input

Returns:

input with all characters converted to lowercase

Since version:

0.8

chararray REGEX_EXTRACT(chararray source, chararray regex, int n)
Parameters:

source: the chararray to search in

regex: the regular expression to search for

n: take the nth match, counting from 0

Returns:

nth subset of the source matching regex; null if there are no matches

Since version:

0.8

(chararray) REGEX_EXTRACT_ALL(chararray source, chararray regex)
Parameters:

source: the chararray to search in

regex: the regular expression to search for

Returns:

Tuple containing all subsets of source matching regex; null if there are no matches

Since version:

0.8

chararray REPLACE(chararray source, chararray toReplace, chararray newValue)
Parameters:

source: the chararray to search in

toReplace: the chararray to be replaced

newValue: the new chararray to replace it with

Returns:

source with all instances of toReplace changed to newValue

Since version:

0.8

long SIZE(chararray input)
Parameter:

input

Returns:

Number of characters in input

Since version:

0.2

long SIZE(bytearray input)
Parameter:

input

Returns:

Number of bytes in input

Since version:

0.2

(chararray) STRSPLIT(chararray source)

Split a chararray by whitespace

Parameter:

source: the chararray to split

Returns:

Tuple with one field for each section of source

Since version:

0.8

(chararray) STRSPLIT(chararray source, chararray regex)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

Returns:

Tuple with one field for each section of source

Since version:

0.8

(chararray) STRSPLIT(chararray source, chararray regex, int maxsplits)

Split a chararray by a regular expression

Parameters:

source: the chararray to split

regex: the regular expression to use as the delimiter

max: the maximum number of splits

Returns:

Tuple with one field for each section of source; if there are more than one maxsplits sections, only the first maxsplits sections will be in the tuple

Since version:

0.8

chararray SUBSTRING(chararray source, int start, int end)
Parameters:

source: the chararray to split

start: the start position (inclusive), counting from 0

end: the end position (exclusive), counting from 0

Returns:

Subchararray; error if any input value has a length shorter than start

Since version:

0.8

{(chararray)} TOKENIZE(chararray input)
Parameter:

source: the chararray to split

Returns:

input split on whitespace, with each resulting value being placed in its own tuple and all tuples placed in the bag

Since version:

0.1

chararray TRIM(chararray input)
Parameter:

input

Returns:

input with all leading and trailing whitespace removed

Since version:

0.8

chararray UCFIRST(chararray input)
Parameter:

input

Returns:

input with the first character converted to uppercase

Since version:

0.8

chararray UPPER(chararray input)
Parameter:

input

Returns:

input with all characters converted to uppercase

Since version:

0.8

Built-in complex type UDFs

{(chararray, chararray, double)} COR({(double)} b1, {(double)} b2)

Calculate the correlation between two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the correlation between b1 and b2

Since version:

0.8

{(chararray, chararray, double)} COV({(double)} b1, {(double)} b2)

Calculate the covariance of two bags of doubles

Parameters:

b1

b2

Returns:

First chararray is the name of b1, second chararray is the name of b2, double is the covariance of b1 and b2

Since version:

0.8

bag DIFF(bag b1, bag b2)
Parameters:

b1

b2

Returns:

All records from b1 that are not in b2, and all records from b2 that are not in b1

Since version:

0.1

long SIZE(map input)
Parameter:

input

Returns:

Number of key-value pairs in input

Since version:

0.2

long SIZE(tuple input)
Parameter:

input

Returns:

Number of fields in input

Since version:

0.2

long SIZE(bag input)
Parameter:

input

Returns:

Number of tuples in input

Since version:

0.2

bag TOBAG(...)
Parameter:

Variable

Returns:

If all inputs have the same schema, the resulting bag will have that schema, else it will have a null schema; if the parameters are tuples, all schemas must have the same field names in addition to types

Since version:

0.8

map TOMAP(...)
Parameter:

Variable

Returns:

Input parameters are paired up and placed in a map as key/value, key/value; all keys must be chararrays; an odd number of arguments will result in an error

Since version:

0.9

bag TOP(int numRecords, int field, bag source)
Parameters:

numRecords: the number of records to return

field: the field to sort on

source: the bag to return records from

Returns:

A bag with numRecords

Since version:

0.8

tuple TOTUPLE(...)
Parameter:

Variable

Returns:

A tuple with all of the fields passed in as arguments

Since version:

0.8

Built-in filter functions

boolean IsEmpty(bag)
Parameter:

input

Returns:

Boolean

Since version:

0.1

boolean IsEmpty(tuple)
Parameter:

input

Returns:

Boolean

Since version:

0.1

Miscellaneous built-in UDF

double RANDOM()
Returns:

A random double between 0 and 1

Since version:

0.4

The best content for your career. Discover unlimited learning on demand for around $1/day.