Fundamentals of Apache Pig

The primary interface to program Apache Pig is Pig Latin, a procedural language that implements ideas of the dataflow paradigm.

Pig Latin programs are generally organized as follows:

  • A LOAD statement reads data from HDFS
  • A series of statements aggregates and manipulates data
  • A STORE statement writes output to the filesystem
  • Alternatively, a DUMP statement displays the output to the terminal

The following example shows a sequence of statements that outputs the top 10 hashtags ordered by the frequency, extracted from the dataset of tweets:

tweets = LOAD 'tweets.json' USING JsonLoader('created_at:chararray, id:long, id_str:chararray, text:chararray'); hashtags = FOREACH tweets { GENERATE FLATTEN( REGEX_EXTRACT( text, '(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)', ...

Get Learning Hadoop 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.