Part I. Introduction to Distributed Computing

The first part of Data Analytics with Hadoop introduces distributed computing for big data using Hadoop. Chapter 1 motivates the need for distributed computing in order to build data products and discusses the primary workflow and opportunity for using Hadoop for data science. Chapter 2 then dives into the technical details of the requirements for distributed storage and computation and explains how Hadoop is an operating system for big data. Chapters 3 and 4 introduce distributed programming using the MapReduce and Spark frameworks, respectively. Finally, Chapter 5 explores typical computations and patterns in both MapReduce and Spark from the perspective of a data scientist doing analytics on large datasets.

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.