Hadoop

Before we jump to loading data, a quick overview of MapReduce is warranted. Although Druid comes prepackaged with a convenient MapReduce job to accommodate historical data, generally speaking, large distributed systems will need custom jobs to perform analyses over the entire data set.

An overview of MapReduce

MapReduce is a framework that breaks processing into two phases: a map phase and a reduce phase. In the map phase, a function is applied to the entire set of input data, one element at a time. Each application of the map function results in a set of tuples, each containing a key and a value. Tuples with similar keys are then combined via the reduce function. The reduce function emits another set of tuples, typically by combining the ...

Get Storm Blueprints: Patterns for Distributed Real-time Computation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.