Chapter 13. Use Cases and Programming Examples

In this chapter we will take a look at several comprehensive Pig examples and real-world Pig use cases.

Sparse Tuples

In “Schema Tuple Optimization” we introduced a more compact tuple implementation called the schema tuple. However, if your input data is sparse, a schema tuple is not the most efficient way to represent your data. You only need to store the position and value of nonempty fields of the tuple—which you can do with a sparse tuple. Since the vast majority of fields in the tuple will be empty, you can save a lot of space with this data structure. Sparse tuples are not natively supported by Pig. However, Pig allows users to define custom tuple implementations, so you can implement them by yourself. In this section, we will show you how to implement the sparse tuple and use it in Pig.

First, we will need to write a SparseTuple class that implements the Tuple interface. However, implementing all methods of the Tuple interface is tedious. To make it easier we derive SparseTuple from AbstractTuple, which already implements most common methods. Inside SparseTuple, we create a TreeMap that stores the index and value of each nonempty field. We also keep track of the size of the tuple. With both fields, we have the complete state of the sparse tuple. Here is the data structure along with the getter and setter methods of SparseTuple:

public class SparseTuple extends AbstractTuple {

    Map<Integer, Object> matrix = new TreeMap<Integer, Object ...

Get Programming Pig, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.