O'Reilly logo

Learning Hadoop 2 by Garry Turkington, Gabriele Modena

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Fundamentals of Apache Pig

The primary interface to program Apache Pig is Pig Latin, a procedural language that implements ideas of the dataflow paradigm.

Pig Latin programs are generally organized as follows:

  • A LOAD statement reads data from HDFS
  • A series of statements aggregates and manipulates data
  • A STORE statement writes output to the filesystem
  • Alternatively, a DUMP statement displays the output to the terminal

The following example shows a sequence of statements that outputs the top 10 hashtags ordered by the frequency, extracted from the dataset of tweets:

tweets = LOAD 'tweets.json' USING JsonLoader('created_at:chararray, id:long, id_str:chararray, text:chararray'); hashtags = FOREACH tweets { GENERATE FLATTEN( REGEX_EXTRACT( text, '(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)', ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required