Deploying the analytics

With Hadoop in place, we can now focus on the distributed processing frameworks that we will use for analysis.

Performing a batch analysis with the Pig infrastructure

The first of the distributed processing frameworks that we will examine is Pig. Pig is a framework for data analysis. It allows the user to articulate analysis in a simple high-level language. These scripts then compile down to MapReduce jobs.

Although Pig can read data from a few different systems (for example, S3), we will use HDFS as our data storage mechanism in this example. Thus, the first step in our analysis is to copy the data into HDFS.

To do this, we issue the following Hadoop commands:

hadoop fs -mkdir /user/bone/temp
hadoop fs -copyFromLocal click_thru_data.txt ...

Get Storm Blueprints: Patterns for Distributed Real-time Computation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.