You are previewing Programming Pig.

Programming Pig

Cover of Programming Pig by Alan Gates Published by O'Reilly Media, Inc.
  1. Programming Pig
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. Preface
      1. Data Addiction
      2. Who Should Read This Book
      3. Conventions Used in This Book
      4. Code Examples in This Book
      5. Using Code Examples
      6. Safari® Books Online
      7. How to Contact Us
      8. Acknowledgments
    3. 1. Introduction
      1. What Is Pig?
      2. Pig’s History
    4. 2. Installing and Running Pig
      1. Downloading and Installing Pig
      2. Running Pig
    5. 3. Grunt
      1. Entering Pig Latin Scripts in Grunt
      2. HDFS Commands in Grunt
      3. Controlling Pig from Grunt
    6. 4. Pig’s Data Model
      1. Types
      2. Schemas
    7. 5. Introduction to Pig Latin
      1. Preliminary Matters
      2. Input and Output
      3. Relational Operations
      4. User Defined Functions
    8. 6. Advanced Pig Latin
      1. Advanced Relational Operations
      2. Integrating Pig with Legacy Code and MapReduce
      3. Nonlinear Data Flows
      4. Controlling Execution
      5. Pig Latin Preprocessor
    9. 7. Developing and Testing Pig Latin Scripts
      1. Development Tools
      2. Testing Your Scripts with PigUnit
    10. 8. Making Pig Fly
      1. Writing Your Scripts to Perform Well
      2. Writing Your UDF to Perform
      3. Tune Pig and Hadoop for Your Job
      4. Using Compression in Intermediate Results
      5. Data Layout Optimization
      6. Bad Record Handling
    11. 9. Embedding Pig Latin in Python
      1. Compile
      2. Bind
      3. Run
      4. Utility Methods
    12. 10. Writing Evaluation and Filter Functions
      1. Writing an Evaluation Function in Java
      2. Algebraic Interface
      3. Accumulator Interface
      4. Python UDFs
      5. Writing Filter Functions
    13. 11. Writing Load and Store Functions
      1. Load Functions
      2. Store Functions
    14. 12. Pig and Other Members of the Hadoop Community
      1. Pig and Hive
      2. Cascading
      3. NoSQL Databases
      4. Metadata in Hadoop
    15. A. Built-in User Defined Functions and Piggybank
      1. Built-in UDFs
      2. Piggybank
    16. B. Overview of Hadoop
      1. MapReduce
      2. Hadoop Distributed File System
    17. Index
    18. About the Author
    19. Colophon
    20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Running Pig

You can run Pig locally on your machine or on your grid. You can also run Pig as part of Amazon’s Elastic MapReduce service.

Running Pig Locally on Your Machine

Running Pig locally on your machine is referred to in Pig parlance as local mode. Local mode is useful for prototyping and debugging your Pig Latin scripts. Some people also use it for small data when they want to apply the same processing to large data—so that their data pipeline is consistent across data of different sizes—but they do not want to waste cluster resources on small files and small jobs.

In versions 0.6 and earlier, Pig executed scripts in local mode itself. Starting with version 0.7, it uses the Hadoop class LocalJobRunner that reads from the local filesystem and executes MapReduce jobs locally. This has the nice property that Pig jobs run locally in the same way as they will on your cluster, and they all run in one process, making debugging much easier. The downside is that it is slow. Setting up a local instance of Hadoop has approximately a 20-second overhead, so even tiny jobs take at least that long.[2]

Let’s run a Pig Latin script in local mode. See Code Examples in This Book for how to download the data and Pig Latin for this example. The simple script in Example 2-1 loads the file NYSE_dividends, groups the file’s rows by stock ticker symbol, and then calculates the average dividend for each symbol.

Example 2-1. Running Pig in local mode

--average_dividend.pig
-- load data from NYSE_dividends, declaring the schema to have 4 fields
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
-- group rows together by stock ticker symbol
grouped   = group dividends by symbol;
-- calculate the average dividend per symbol
avg       = foreach grouped generate group, AVG(dividends.dividend);
-- store the results to average_dividend
store avg into 'average_dividend';

If you use head -5 to look at the NYSE_dividends file, you will see:

NYSE    CPO 2009-12-30  0.14
NYSE    CPO 2009-09-28  0.14
NYSE    CPO 2009-06-26  0.14
NYSE    CPO 2009-03-27  0.14
NYSE    CPO 2009-01-06  0.14

This matches the schema we declared in our Pig Latin script. The first field is the exchange this stock is traded on, the second field is the stock ticker symbol, the third is the date the dividend was paid, and the fourth is the amount of the dividend.

Note

Remember that to run Pig you will need to set the JAVA_HOME environment variable to the directory that contains your Java distribution.

Switch to the directory where NYSE_dividends is located. You can then run this example on your local machine by entering:

pig_path/bin/pig -x local average_dividend.pig

where pig_path is the path to the Pig installation on your local machine.

The result should be a lot of output on your screen. Much of this is MapReduce’s LocalJobRunner generating logs. But some of it is Pig telling you how it will execute the script, giving you the status as it executes, etc. Near the bottom of the output you should see the simple message Success!. This means all went well. The script stores its output to average_dividend, so you might expect to find a file by that name in your local directory. Instead you will find a directory named average_dividend that contains a file named part-r-00000. Because Hadoop is a distributed system and usually processes data in parallel, when it outputs data to a file it creates a directory with the file’s name, and each writer creates a separate part file in that directory. In this case we had one writer, so we have one part file. We can look in that part file for the results by entering:

cat average_dividend/part-r-00000 | head -5

which returns:

CA      0.04
CB      0.35
CE      0.04
CF      0.1
CI      0.04

Running Pig on Your Hadoop Cluster

Most of the time you will be running Pig on your Hadoop cluster. As was covered in Downloading and Installing Pig, Pig runs locally on your machine or your gateway machine. All of the parsing, checking, and planning is done locally. Pig then executes MapReduce jobs in your cluster.

Note

When I say your gateway machine, I mean the machine from which you are launching Pig jobs. Usually this will be one or more machines that have access to your Hadoop cluster. However, depending on your configuration, it could be your local machine as well.

The only thing Pig needs to know to run on your cluster is the location of your cluster’s NameNode and JobTracker. The NameNode is the manager of HDFS, and the JobTracker coordinates MapReduce jobs. In Hadoop 0.18 and earlier, these locations are found in your hadoop-site.xml file. In Hadoop 0.20 and later, they are in three separate files: core-site.xml, hdfs-site.xml, and mapred-site.xml.

If you are already running Hadoop jobs from your gateway machine via MapReduce or another tool, you most likely have these files present. If not, the best course is to copy these files from nodes in your cluster to a location on your gateway machine. This guarantees that you get the proper addresses plus any site-specific settings.

If, for whatever reason, it is not possible to copy the appropriate files from your cluster, you can create a hadoop-site.xml file yourself. It will look like the following:

<configuration>
<property>                                                                     
  <name>fs.default.name</name>                                                 
  <value>namenode_hostname:port</value>                         
</property>

<property>                                                                     
  <name>mapred.job.tracker</name>
  <value>jobtrack_hostname:port</value>                        
</property>
</configuration>

You will need to find the names and ports for your NameNode and JobTracker from your cluster administrator.

Once you have located, copied, or created these files, you will need to tell Pig the directory they are in by setting the PIG_CLASSPATH environment variable to that directory. Note that this must point to the directory that the XML file is in, not the file itself. Pig will read all XML and properties files in that directory.

Let’s run the same script on your cluster that we ran in the local mode example (Example 2-1). If you are running on a Hadoop cluster you have never used before, you will most likely need to create a home directory. Pig can do this for you:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -mkdir /user/username

where hadoop_conf_dir is the directory where your hadoop-site.xml or core-site.xml, hdfs-site.xml, and mapred-site.xml files are located; pig_path is the path to Pig on your gateway machine; and username is your username on the gateway machine. If you are using 0.5 or earlier, change fs -mkdir to mkdir.

Note

Remember, you need to set JAVA_HOME before executing any Pig commands. See Downloading the Pig Package from Apache for details.

In order to run this example on your cluster, you first need to copy the data to your cluster:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -copyFromLocal NYSE_dividends
    NYSE_dividends

If you are running Pig 0.5 or earlier, change fs -copyFromLocal to copyFromLocal.

Now you are ready to run the Pig Latin script itself:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig average_dividend.pig

The first few lines of output will tell you how Pig is connecting to your cluster. After that it will describe its progress in executing your script. It is important for you to verify that Pig is connecting to the appropriate filesystem and JobTracker by checking that these values match the values for your cluster. If the filesystem is listed as file:/// or the JobTracker says localhost, Pig did not connect to your cluster. You will need to check that you entered the values properly in your configuration files and properly set PIG_CLASSPATH to the directory that contains those files.

Near the end of the output there should be a line saying Success!. This means that your execution succeeded. You can see the results by entering:

PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e cat average_dividend

which should give you the same connection information and then dump all of the stock ticker symbols and their average dividends.

In Example 2-1 you may have noticed that I made a point to say that average_dividend is a directory, and thus you have to cat the part file contained in that directory. However, in this example I ran cat directly on average_dividend. If you list average_dividend, you will see that it is still a directory in this example, but in Pig, cat can operate on directories. See Chapter 3 for a discussion of this.

Running Pig in the Cloud

Cloud computing[3] along with the software as a service (SaaS) model have taken off in recent years. This has been fortuitous for hardware-intensive applications such as Hadoop. Setting up and maintaining a Hadoop cluster is an expensive proposition in terms of hardware acquisition, facility costs, and maintenance and administration. Many users find that it is cheaper to rent the hardware they need instead.

Whether you or your organization decides to use Hadoop and Pig in the cloud or on owned and operated machines, the instructions for running Pig on your cluster are the same; see Running Pig on Your Hadoop Cluster.

However, Amazon’s Elastic MapReduce (EMR) cloud offering is different. Rather than allowing customers to rent machines for any type of process (like Amazon’s Elastic Cloud Computing [EC2] service and other cloud services), EMR allows users to rent virtual Hadoop clusters. These clusters read data from and write data to Amazon’s Simple Storage Service (S3). This means users do not even need to set up their own Hadoop cluster, which they would have to do if they used EC2 or a similar service.

EMR users can access their rented Hadoop cluster via their browser, SSH, or a web services API. For information about EMR, visit http://aws.amazon.com/elasticmapreduce. However, I suggest beginning with this nice tutorial, which will introduce you to the service.

Command-Line and Configuration Options

Pig has a number of command-line options that you can use with it. You can see the full list by entering pig -h. Most of these options will be discussed later, in the sections that cover the features these options control. In this section I discuss the remaining miscellaneous options:

-e or -execute

Execute a single command in Pig. For example, pig -e fs -ls will list your home directory.

-h or -help

List the available command-line options.

-h properties

List the properties that Pig will use if they are set by the user.

-P or -propertyFile

Specify a property file that Pig should read.

-version

Print the version of Pig.

Pig also uses a number of Java properties. The entire list can be printed out with pig -h properties. Specific properties are discussed later in sections that cover the features they control.

Hadoop also has a number of Java properties it uses to determine its behavior. For example, you can pass options to the JVM that runs your map and reduce tasks by setting mapred.child.java.opts. In Pig version 0.8 and later, these can be passed to Pig, and then Pig will pass them on to Hadoop when it invokes Hadoop. In earlier versions, the properties had to be in hadoop-site.xml so that the Hadoop client itself would pick them up.

Properties can be passed to Pig on the command line using -D in the same format as any Java property—for example, bin/pig -D exectype=local. When placed on the command line, these property definitions must come before any Pig-specific command-line options (such as -x local). They can also be specified in the conf/pig.properties file that is part of your Pig distribution. Finally, you can specify a separate properties file by using -P. If properties are specified on both the command line and in a properties file, the command-line specification takes precedence.

Return Codes

Pig uses return codes, described in Table 2-1, to communicate success or failure.

Table 2-1. Pig return codes

ValueMeaningComments
0Success 
1Retriable failure 
2Failure 
3Partial failureUsed with multiquery; see Nonlinear Data Flows
4Illegal arguments passed to Pig 
5IOException thrownWould usually be thrown by a UDF
6PigException thrownUsually means a Python UDF raised an exception
7ParseException thrown (can happen after parsing if variable substitution is being done) 
8Throwable thrown (an unexpected exception) 


[2] Another reason for switching to MapReduce for local mode was that as Pig added features that took advantage of more advanced MapReduce features, it became difficult or impossible to replicate those features in local mode. Thus local mode and MapReduce mode were diverging in their feature set.

[3] Being the current flavor of the month, the term cloud computing is being used to describe just about anything that takes more than one computer and is not located on a person’s desktop. In this chapter I use cloud computing to mean the ability to rent a cluster of computers and place software of your choosing on those computers.

The best content for your career. Discover unlimited learning on demand for around $1/day.