No Failover Testing

I once worked with a team designing a large e-commerce web site infrastructure. When I say large, I mean eight Cisco 6509 switches serving more than 200 physical servers (most with multiple Solaris zones), providing upwards of a gigabit per second of content. Timelines were tight, and everyone was stressed. In all the compression of timelines that occurred during the life of the project, one of the key phases eliminated was failure testing.

After the site went live, a device failed. The site was designed to withstand any single point of failure, yet it stopped functioning properly. It turned out the failover device had been misconfigured in a way that only presented a problem when the active device failed. Because the failure caused a loss of connectivity to the site, we had no way of getting to the failed equipment, except to drive to the collocation facility. This failure, which should not have been possible, resulted in a two-hour outage while someone drove to the facility with a console cable.

Had failover testing been done, the problem would have been found during testing and the outage avoided. The design was correct, but the implementation of the design was not. Always insist on failure testing in high-availability environments. Failure testing should be done on a regular basis, and included in normal maintenance at scheduled intervals as well.

Get Network Warrior now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.