Chapter 12. High Availability

A key part of the enterprise architecture of critical services is planning for failure at all levels of the stack. The assumption of this chapter is that users and operators both want services to be as available as possible. But how do we achieve that? What happens if a server hosting some critical service goes down? What if an entire rack with several cluster machines loses power? What about a power distribution unit serving several racks? What if there are transient problems that degrade node performance? Having a plan to handle such scenarios—and regularly testing that plan—is of paramount importance.

The good news is that most of the components in a Hadoop cluster are built from the ground up with failure in mind and have built-in mechanisms for dealing with failure of individual components. In fact, the central design principle behind Hadoop is to build a reliable system from individually unreliable components.

If architected correctly, a single Hadoop cluster will prove incredibly resilient to failure. In this chapter, we cover how core Hadoop services and other projects in the ecosystem can be set up for high availability (HA) within a single cluster. We focus only on the higher-level concepts in this chapter; some of the lower-level aspects related to physical infrastructure, such as dual-power supplies and redundant cabling, are covered in “Basic Datacenter Concepts”. In Chapter 13 we discuss some of the aspects of backup and replication ...

Get Architecting Modern Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.