The surge in interest in Big Data is partly due to the availability of new tools that enable massive amounts of data to be readily processed. One such tool is Apache Hadoop, which is an open source framework for distributing parallel tasks over a cluster of machines, based upon the MapReduce model of programming. However, writing applications using the Hadoop framework still requires a completely new application to be developed in Java even for trivial tasks. What is required is a convenient tool that allows simple tasks to be expressed in a high-level form without resorting to manual coding.
Enter Apache Pig, which is exactly such a tool. Pig is a high-level platform for creating MapReduce programs that work on top of Hadoop. The language used for writing these programs is called Pig Latin, which is an SQL-like language for describing operations that need to be performed on large datasets. One major difference between SQL and Pig Latin is that SQL is declarative in nature, whereas Pig Latin is more procedural (it does not abstract away the implementation details). Look at Chapter 5: Introduction to Pig Latin in Programming Pig for more on Pig Latin. Let’s look at a simple example to see Pig Latin in action:
logs = LOAD '/var/log/messages';
warnings = FILTER logs BY $0 MATCHES '.*WARN+.*';
STORE warnings INTO 'warnings';
This script loads the ‘/var/log/messages’ file into a bag named logs. A filter is then applied on this logs bag, so only rows that contain the uppercase string “WARN” are allowed, and the result is stored in another bag warnings. This bag is then stored into the file ‘warnings’.
When executed within a proper environment, this script will distribute the task of filtering logs over a Hadoop cluster (which may contain any number of machines).
A Pig Latin script essentially consists of a sequence of statements. Each statement takes an input (such as a bag, int, long, chararray, bytearray) and produce an output. Each statement operates on a relation or a tuple and is called a relational operator. Some of the operators provided by Pig Latin include FILTER, FOREACH, GROUP, ORDER, SPLIT, JOIN, LOAD and STORE.
Let’s look at a more complex example to understand the facilities offered by Pig Latin:
words = LOAD 'words' AS (word:chararray);
word_groups = GROUP words BY word;
word_count = FOREACH word_groups GENERATE COUNT(words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO 'wordcount';
The script above uses a combination of operators to count the number of words in a list of words and save the count. First the file ‘words’ is loaded as a tuple of one element word of type charrarry. This bag is then grouped by word and by using a FOREACH operator on this grouped bag, the occurrence of each word is counted and stored in a new bag word_count. This bag is then sorted in a descending order and the result is stored in a file named ‘wordcount’.
This is just the tip of the iceberg when it comes to writing parallel tasks using Apache Pig. For more in-depth information, visit the official website at http://pig.apache.org/.
You can find out more on using Pig Latin in the books listed below.
Safari Books Online has the content you need
Check out these Pig Latin books available from Safari Books Online:
|Programming Pig is an ideal learning tool and reference for Apache Pig, the programming language that helps you describe and run large data projects on Hadoop. With Pig, you can analyze data without having to create a full-fledged application—making it easy for you to experiment with new data sets. This book shows you how.|
|In Hadoop MapReduce Cookbook is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases.|
|NoSQL databases are an efficient and powerful tool for storing and manipulating vast quantities of data. Most NoSQL databases scale well as data grows. In addition, they are often malleable and flexible enough to accommodate semi-structured and sparse data sets. Professional NoSQL presents fundamental concepts and practical solutions for getting you ready to use NoSQL databases. Expert author Shashank Tiwari begins with a helpful introduction on the subject of NoSQL, explains its characteristics and typical uses, and looks at where it fits in the application stack. Unique insights help you choose which NoSQL solutions are best for solving your specific data storage needs.|