Chapter 3. Pig’s Data Model

Before we take a look at the operators that Pig Latin provides, we first need to understand Pig’s data model. This includes Pig’s data types, how it handles concepts such as missing data, and how you can describe your data to Pig.

Types

Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types, which contain other types.

Scalar Types

Pig’s scalar types are simple types that appear in most programming languages. With the exception of bytearrays, they are all represented in Pig interfaces by java.lang classes, making them easy to work with in UDFs:

Int

An integer. Ints are represented in interfaces by java.lang.Integer. They store four-byte signed integers. Constant integers are expressed as integer numbers: for example, 42.

Long

A long integer. Longs are represented in interfaces by java.lang.Long. They store eight-byte signed integers. Constant longs are expressed as integer numbers with an L appended: for example, 5000000000L.

Biginteger (since Pig 0.12)

An integer of effectively infinite size (it is bounded only by available memory). Bigintegers are represented in interfaces by java.math.BigInteger. There are no biginteger constants.1 Chararray and numeric types can be cast to biginteger to produce a constant value in the script. An important note: performance of bigintegers is significantly worse than ints or longs. Whenever your value will fit into one of those types you should use it rather than biginteger. ...

Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.