Chapter 16. Hardware Failures and Recovery

Martin Lanner

All clusters occasionally experience a hardware failure. Hardware failures can come in many forms: whole drives can become unresponsive, a power supply to a node can fail, a switch can break and affect a single node or an entire rack, or a power outage can take out racks or an entire region.

Fortunately, Swift was designed to withstand hardware failures, both small and large. Drives, nodes, or even whole racks can fail, and a Swift cluster can continue to operate without impact to the durability and availability of the data.

By default, Swift places data in cluster locations that are as unique as possible, preferring locations that are in different regions, zones, nodes, and disks. This makes it easier to deploy small clusters and provides great durability when the cluster experiences a hardware failure. All data stored in Swift also has several “handoff” locations defined, which are alternative data placement locations in the cluster should one of the three replicas not be available due to a hardware failure or unavailability.

It is important to note that drives in a Swift cluster are not mirrored and are not configured with RAID. This means that when there is a hardware failure, such as a drive failure, the entire cluster participates in the replication of the data to handoff locations. There are no RAID rebuilds, which could cripple the performance of the cluster.

Handling a Failed Drive

A drive failure in Swift is not an emergency. ...

Get OpenStack Swift now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.