Part III. Taking Hadoop to the Cloud

In the previous chapters, we studied how to build Hadoop clusters that meet enterprise requirements; we now turn our attention to achieving the same in the cloud. Cloud technology enables the entire stack of information technology to be consumed as fully programmable and automated services. For example, storage, networking, and servers become infrastructure as a service (IaaS), and platform-level software such as database deployments or access management software becomes platform as a service (PaaS). The high degree of programmability and automation allows almost complete self-service for the customer to control and customize each layer, from IaaS to PaaS.

Before large-scale public cloud computing became part of the mainstream in IT, virtualization for Hadoop was mostly considered an antipattern. This was in large part due to Hadoop’s distributed nature and its extensive reliance on local disks on each server for efficient operation. Running Hadoop on clouds thus often boils down to one question: can I store all my data in the cloud and process it efficiently? The answer is yes.

Public cloud providers operate at such scale (often called hyperscale), that Hadoop environments and their high demand for I/O throughput can be accommodated at reasonable prices. In the meantime, Hadoop distributors have also acted on the significant opportunity of Hadoop in the cloud: they have massively invested in easing deployment (increasing performance and efficiency) ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.