An introduction to MapReduce

MapReduce is a programming model that allows you to express algorithms for efficient execution on a distributed system. The MapReduce model was first introduced by Google in 2004 (https://research.google.com/archive/mapreduce.html), as a way to automatically partition datasets over different machines and for automatic local processing and the communication between cluster nodes.

The MapReduce framework was used in cooperation with a distributed filesystem, the Google File System (GFS or GoogleFS), which was designed to partition and replicate data across the computing cluster. Partitioning was useful for storing and processing datasets that wouldn't fit on a single node while replication ensured that the system ...

Get Python High Performance - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.