In this chapter we will take a look at several comprehensive Pig examples and real-world Pig use cases.
In “Schema Tuple Optimization” we introduced a more compact tuple implementation called the schema tuple. However, if your input data is sparse, a schema tuple is not the most efficient way to represent your data. You only need to store the position and value of nonempty fields of the tuple—which you can do with a sparse tuple. Since the vast majority of fields in the tuple will be empty, you can save a lot of space with this data structure. Sparse tuples are not natively supported by Pig. However, Pig allows users to define custom tuple implementations, so you can implement them by yourself. In this section, we will show you how to implement the sparse tuple and use it in Pig.
First, we will need to write a
class that implements the
Tuple interface. However,
implementing all methods of the
Tuple interface is
tedious. To make it easier we derive
AbstractTuple, which already implements most
common methods. Inside
SparseTuple, we create a
TreeMap that stores the index and value of each
nonempty field. We also keep track of the size of the tuple. With both
fields, we have the complete state of the sparse tuple. Here is the data
structure along with the getter and setter methods of