You are previewing Programming Pig.

Programming Pig

Cover of Programming Pig by Alan Gates Published by O'Reilly Media, Inc.
  1. Programming Pig
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. Preface
      1. Data Addiction
      2. Who Should Read This Book
      3. Conventions Used in This Book
      4. Code Examples in This Book
      5. Using Code Examples
      6. Safari® Books Online
      7. How to Contact Us
      8. Acknowledgments
    3. 1. Introduction
      1. What Is Pig?
      2. Pig’s History
    4. 2. Installing and Running Pig
      1. Downloading and Installing Pig
      2. Running Pig
    5. 3. Grunt
      1. Entering Pig Latin Scripts in Grunt
      2. HDFS Commands in Grunt
      3. Controlling Pig from Grunt
    6. 4. Pig’s Data Model
      1. Types
      2. Schemas
    7. 5. Introduction to Pig Latin
      1. Preliminary Matters
      2. Input and Output
      3. Relational Operations
      4. User Defined Functions
    8. 6. Advanced Pig Latin
      1. Advanced Relational Operations
      2. Integrating Pig with Legacy Code and MapReduce
      3. Nonlinear Data Flows
      4. Controlling Execution
      5. Pig Latin Preprocessor
    9. 7. Developing and Testing Pig Latin Scripts
      1. Development Tools
      2. Testing Your Scripts with PigUnit
    10. 8. Making Pig Fly
      1. Writing Your Scripts to Perform Well
      2. Writing Your UDF to Perform
      3. Tune Pig and Hadoop for Your Job
      4. Using Compression in Intermediate Results
      5. Data Layout Optimization
      6. Bad Record Handling
    11. 9. Embedding Pig Latin in Python
      1. Compile
      2. Bind
      3. Run
      4. Utility Methods
    12. 10. Writing Evaluation and Filter Functions
      1. Writing an Evaluation Function in Java
      2. Algebraic Interface
      3. Accumulator Interface
      4. Python UDFs
      5. Writing Filter Functions
    13. 11. Writing Load and Store Functions
      1. Load Functions
      2. Store Functions
    14. 12. Pig and Other Members of the Hadoop Community
      1. Pig and Hive
      2. Cascading
      3. NoSQL Databases
      4. Metadata in Hadoop
    15. A. Built-in User Defined Functions and Piggybank
      1. Built-in UDFs
      2. Piggybank
    16. B. Overview of Hadoop
      1. MapReduce
      2. Hadoop Distributed File System
    17. Index
    18. About the Author
    19. Colophon
    20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Input and Output

Before you can do anything of interest, you need to be able to add inputs and outputs to your data flows.

Load

The first step to any data flow is to specify your input. In Pig Latin this is done with the load statement. By default, load looks for your data on HDFS in a tab-delimited file using the default load function PigStorage. divs = load '/data/examples/NYSE_dividends'; will look for a file called NYSE_dividends in the directory /data/examples. You can also specify relative path names. By default, your Pig jobs will run in your home directory on HDFS, /users/yourlogin. Unless you change directories, all relative paths will be evaluated from there. You can also specify a full URL for the path, for example, hdfs://nn.acme.com/data/examples/NYSE_dividends to read the file from the HDFS instance that has nn.acme.com as a NameNode.

In practice, most of your data will not be in tab-separated text files. You also might be loading data from storage systems other than HDFS. Pig allows you to specify the function for loading your data with the using clause. For example, if you wanted to load your data from HBase, you would use the loader for HBase:

divs = load 'NYSE_dividends' using HBaseStorage();

If you do not specify a load function, the built-in function PigStorage will be used. You can also pass arguments to your load function via the using clause. For example, if you are reading comma-separated text data, PigStorage takes an argument to indicate which character to use as a separator:

divs = load 'NYSE_dividends' using PigStorage(',');

The load statement also can have an as clause, which allows you to specify the schema of the data you are loading. (The syntax and semantics of declaring schemas in Pig Latin is discussed in Schemas.)

divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);

When specifying a file to read from HDFS, you can specify directories. In this case, Pig will find all files under the directory you specify and use them as input for that load statement. So, if you had a directory input with two datafiles today and yesterday under it, and you specified input as your file to load, Pig will read both today and yesterday as input. If the directory you specify has other directories, files in those directories will be included as well.

PigStorage and TextLoader, the two built-in Pig load functions that operate on HDFS files, support globs.[6] With globs, you can read multiple files that are not under the same directory or read some but not all files under a directory. Table 5-1 describes globs that are valid in Hadoop 0.20. Be aware that glob meaning is determined by HDFS underneath Pig, so the globs that will work for you depend on your version of HDFS. Also, if you are issuing Pig Latin commands from a Unix shell command line, you will need to escape many of the glob characters to prevent your shell from expanding them.

Table 5-1. Globs in Hadoop 0.20

GlobMeaning
?Matches any single character.
*Matches zero or more characters.
[abc]Matches a single character from character set (a,b,c).
[a-z]Matches a single character from the character range (a..z), inclusive. The first character must be lexicographically less than or equal to the second character.
[^abc]Matches a single character that is not in the character set (a, b, c). The ^ character must occur immediately to the right of the opening bracket.
[^a-z]Matches a single character that is not from the character range (a..z), inclusive. The ^ character must occur immediately to the right of the opening bracket.
\cRemoves (escapes) any special meaning of character c.
{ab,cd}Matches a string from the string set {ab, cd}.

Store

After you have finished processing your data, you will want to write it out somewhere. Pig provides the store statement for this purpose. In many ways it is the mirror image of the load statement. By default, Pig stores your data on HDFS in a tab-delimited file using PigStorage:[7]

store processed into '/data/examples/processed';

Pig will write the results of your processing into a directory processed in the directory /data/examples. You can specify relative path names, as well as a full URL for the path, such as hdfs://nn.acme.com/data/examples/processed.

If you do not specify a store function, PigStorage will be used. You can specify a different store function with a using clause:

store processed into 'processed' using
      HBaseStorage();

You can also pass arguments to your store function. For example, if you want to store your data as comma-separated text data, PigStorage takes an argument to indicate which character to use as a separator:

store processed into 'processed' using PigStorage(',');

As noted in Running Pig, when writing to a filesystem, processed will be a directory with part files rather than a single file. But how many part files will be created? That depends on the parallelism of the last job before the store. If it has reduces, it will be determined by the parallel level set for that job. See Parallel for information on how this is determined. If it is a map-only job, it will be determined by the number of maps, which is controlled by Hadoop and not Pig.

Dump

In most cases you will want to store your data somewhere when you are done processing it. But occasionally you will want to see it on the screen. This is particularly useful during debugging and prototyping sessions. It can also be useful for quick ad hoc jobs. dump directs the output of your script to your screen:

dump processed;

Up through version 0.7, the output of dump matches the format of constants in Pig Latin. So, longs are followed by an L, and floats by an F, and maps are surrounded by [] (brackets), tuples by () (parentheses), and bags by {} (braces). Starting with version 0.8, the L for longs and F for floats have been dropped, though the markers for the complex types have been kept. Nulls are indicated by missing values, and fields are separated by commas. Because each record in the output is a tuple, it is surrounded by ().



[6] Any loader that uses FileInputFormat as its InputFormat will support globs. Most loaders that load data from HDFS use this InputFormat.

[7] A single function can be both a load and store function, as PigStorage is.

The best content for your career. Discover unlimited learning on demand for around $1/day.