Chapter 10. Summary: Doing Distributed Data Science

Throughout this book, we’ve looked at specific pieces of the Hadoop ecosystem. Part I discussed how to interact with and utilize a cluster. As we’ve discussed, Hadoop is an operating system for distributed computing; like an operating system on a local computer that provides a file system and process management, Hadoop provides distributed data storage and access through HDFS as well as a resource and scheduling framework in the form of YARN. Together, HDFS and YARN provide a mechanism to do distributed analysis on extremely large datasets.

The original method to program distributed jobs was to use the MapReduce framework, which allowed you to specify mapper and reducer tasks that could be chained together for larger computations. Because Python is one of the most popular tools for data science, we looked specifically at how you might use Hadoop Streaming to execute MapReduce jobs with Python scripts. We also explored a more native solution: the use of Spark’s Python API to execute Spark jobs in a Hadoop cluster using YARN. Finally, we wrapped up our discussion of lower-level tools with a look at distributed analyses and design patterns that are routinely employed on a cluster.

Part II shifted away completely from the lower-level programming details to the higher-level tools for data mining, data ingestion, data flows, and machine learning. This section oriented itself toward the more day-to-day aspects of performing distributed ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.