Before you can run Pig on your machine or your Hadoop cluster, you will need to download and install it. If someone else has taken care of this, you can skip ahead to Running Pig.
This is the official version of Apache Pig. It comes packaged with all of the JAR files needed to run Pig. It can be downloaded by going to Pig’s release page.
Pig does not need to be installed on your Hadoop cluster. It runs on the machine from which you launch Hadoop jobs. Though you can run Pig from your laptop or desktop, in practice, most cluster owners set up one or more machines that have access to their Hadoop cluster but are not part of the cluster (that is, they are not data nodes or task nodes). This makes it easier for administrators to update Pig and associated tools, as well as to secure access to the clusters. These machines are called gateway machines or edge machines. In this book I use the term gateway machine.
You will need to install Pig on these gateway machines. If your Hadoop cluster is accessible from your desktop or laptop, you can install Pig there as well. Also, you can install Pig on your local machine if you plan to use Pig in local mode.
The core of Pig is written in Java and is thus portable across operating systems. The shell script that starts Pig is a bash script, so it requires a Unix environment. Hadoop, which Pig depends on, even in local mode, also requires a Unix environment for its filesystem operations. In practice, most Hadoop clusters run a flavor of Linux. Many Pig developers develop and test Pig on Mac OS X.
Pig requires Java 1.6, and Pig versions 0.5 through 0.9 require Hadoop 0.20. For future versions, check the download page for information on what version(s) of Hadoop they require. The correct version of Hadoop is included with the Pig download. If you plan to use Pig in local mode or install it on a gateway machine where Hadoop is not currently installed, there is no need to download Hadoop separately.
Once you have downloaded Pig, you can place it anywhere you like on your machine, as it does not depend on being in a certain location. To install it, place the tarball in the directory of your choosing and type:
The only other setup in preparation for running
Pig is making sure that the environment variable
JAVA_HOME is set to the directory that contains your Java
distribution. Pig will fail immediately if this value is not in the
environment. You can set this in your shell, specify it on the command
line when you invoke Pig, or set it explicitly in your copy of the Pig
pig, located in the bin directory that you just unpacked. You can
find the appropriate value for
JAVA_HOME by executing
java and stripping the
bin/java from the
end of the result.
In addition to the official Apache version, there are companies that repackage and distribute Hadoop and associated tools. Currently the most popular of these is Cloudera, which produces RPMs for Red Hat–based systems and packages for use with APT on Debian systems. It also provides tarballs for other systems that cannot use one of these package managers.
The upside of using a distribution like Cloudera’s is that all of the tools are packaged and tested together. Also, if you need professional support, it is available. The downside is that you are constrained to move at the speed of your distribution provider. There is a delay between an Apache release of Pig and its availability in various distributions.
For complete instructions on downloading and installing Hadoop and Pig from Cloudera, see Cloudera’s download site. Note that you have to download Pig separately; it is not part of the Hadoop package.
In addition to the official release available from Pig’s
Apache site, it is possible to download Pig from Apache’s
Maven repository. This site includes JAR files for Pig, for the source code, and for the
Javadocs, as well as the POM file that defines Pig’s dependencies.
Development tools that are Maven-aware can use this to pull down Pig’s
source and Javadoc. If you use
ant in your build process, you can also pull the Pig
JAR from this repository automatically.
When you download Pig from Apache, you also get
the Pig source code. This enables you to debug your version of
Pig or just peruse the code to see how it works. But if you want to live
on the edge and try out a feature or a bug fix before it is available in
a release, you can download the source from Apache’s Subversion
repository. You can also apply patches that have been uploaded to Pig’s
system but that are not yet checked into the code repository.
Information on checking out Pig using
svn or cloning the repository via
git is available on
Pig’s version control