We will now consider some of the more complex and most critical parts of Pig: data input and output. Operating on huge datasets is inherently I/O-intensive. Hadoop’s massive parallelism and movement of processing to the data mitigates but does not remove this. Having efficient methods to load and store data is therefore critical. Pig provides default load and store functions for text data and for HBase, but many users find they need to write their own load and store functions to handle the data formats and storage mechanisms they use.
As with evaluation functions, the design goal for
load and store functions in Pig was to make easy things easy
and hard things possible. Another aim was to make load and store functions a
thin wrapper over Hadoop’s
OutputFormat. The intention is that once you have an
input format and output format for your data, the additional work of
creating and storing Pig tuples is minimal. In the same way evaluation
functions are implemented, more complex features such as schema management
and projection pushdown are done via separate interfaces to avoid cluttering
the base interface.
One other important design goal for load and store
functions was to not assume that the input sources and output sinks are
HDFS. In the examples throughout this book,
A = load 'foo'; has
foo is a file, but there is no need for that to be
foo is a resource locator that makes sense to your load function. ...