Impact Duration Versus Incident Duration

May 31, 2008, at 5:00 p.m. local time, high-voltage lines in the main electrical room of a datacenter in Houston owned by hosting provider The Planet shorted. The resultant explosion was large enough to knock down three walls. Due to fire safety concerns the backup generators were taken offline as well. Power was able to be restored to portions of the datacenter after a few days. But for thousands of servers, failover in this case involved physically transporting the boxes to another datacenter.

When disaster strikes, all you need to worry about is getting your user traffic away from the problem as quickly as possible. You need to mitigate against impact, now. Don't overly worry about fixing the original problem; once you've stopped the impact, you have plenty of time to remediate the incident. Some rare accidents, such as the explosion mentioned in the preceding paragraph, may take many weeks to repair. But as datacenters get larger, even the more common incidents such as brief power loss can take days to recover from. It takes a long time to bring up a datacenter containing a hundred thousand servers. Focus your architecture on minimizing impact duration, rather than incident duration (which is often out of your hands, anyway).

So, how do you get your user traffic away from the problem site? The usual solution is to use a Global Server Load Balancing (GSLB) platform. This is essentially a dynamic authoritative DNS server that can hand out ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.