Fundamentals of Apache Pig
The primary interface to program Apache Pig is Pig Latin, a procedural language that implements ideas of the dataflow paradigm.
Pig Latin programs are generally organized as follows:
- A
LOAD
statement reads data from HDFS - A series of statements aggregates and manipulates data
- A
STORE
statement writes output to the filesystem - Alternatively, a
DUMP
statement displays the output to the terminal
The following example shows a sequence of statements that outputs the top 10 hashtags ordered by the frequency, extracted from the dataset of tweets:
tweets = LOAD 'tweets.json' USING JsonLoader('created_at:chararray, id:long, id_str:chararray, text:chararray'); hashtags = FOREACH tweets { GENERATE FLATTEN( REGEX_EXTRACT( text, '(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)', ...
Get Learning Hadoop 2 now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.