Chapter 3. Apache Hadoop

By Edd Dumbill

Apache Hadoop has been the driving force behind the growth of the big data industry. You’ll hear it mentioned often, along with associated technologies such as Hive and Pig. But what does it do, and why do you need all its strangely-named friends such as Oozie, Zookeeper and Flume?

image with no caption

Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before?

Existing enterprise data warehouses and relational databases excel at processing structured data, and can store massive amounts of data, though at cost. However, this requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference.

This article examines the components of the Hadoop ecosystem and explains the functions of each.

The Core of Hadoop: MapReduce

Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing. In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases ...

Get Planning for Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.