Gradual Failures

When replicating extremely large data stores between two quite distant locations, it may become infeasible to achieve real-time or even near-real-time replication delays. A write which hits one datacenter may take up to several hours to be copied to the other, along with all the other simultaneous writes carpooling along in the replication stream. This can make it quite difficult to achieve a tight RPO. If one site were to suddenly fall into the ocean, you would lose several hours' worth of data! Fortunately, datacenters don't fall into oceans very often. Sudden outages generally occur only when a utility power failure combines with a UPS or generator failure (with the odd explosion from time to time). However, there are plenty of other ways for datacenters to fail, and a surprising number of them happen slowly. If you detect the problem and react quickly, this can give you precious time to sync up your replication stream and save the data.

A few years back, we had an HVAC failure in one of our datacenters (an HVAC is basically an air conditioner the size of a Mack truck). This caused the temperature to start rising in one part of the facility. It shouldn't have been that big a deal; the datacenter was designed to be able to lose an HVAC and still keep the ambient temperature at a reasonable level with the remaining units. Unfortunately, there was a fire sensor in the area that was getting hot. It sent a false positive, trigging the alarm. Now, the first thing you ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Operations by John Allspaw, Jesse Robbins

Gradual Failures

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly