Foreword

Jeremy Edberg, Information Cowboy, December 2012

In mid-2008, I was handling operations for reddit.com, an online community for sharing and discussing links, serving a few tens of millions of page views per month. At the time, we were hosting the whole site on 21 1U HP servers (in addition to four of the original servers for the site) in two racks in a San Francisco data center. Around that time, Steve, one of the founders of reddit, came to me and suggested I check out this AWS thing that his buddies at Justin.tv had been using with some success; he thought it might be good for us, too. I set up a VPN; we copied over a set of our data, and started using it for batch processing.

In early 2009, we had a problem: we needed more servers for live traffic, so we had to make a choice—build out another rack of servers, or move to AWS. We chose the latter, partly because we didn’t know what our growth was going to look like, and partly because it gave us enormous flexibility for resiliency and redundancy by offering multiple availability zones, as well as multiple regions if we ever got to that point. Also, I was tired of running to the data center every time a disk failed, a fan died, a CPU melted, etc.

When designing any architecture, one of the first assumptions one should make is that any part of the system can break at any time. AWS is no exception. Instead of fearing this failure, one must embrace it. At reddit, one of the things we got right with AWS from the start was making sure that we had copies of our data in at least two zones. This proved handy during the great EBS outage of 2011. While we were down for a while, it was for a lot less time than most sites, in large part because we were able to spin up our databases in the other zone, where we kept a second copy of all of our data. If not for that, we would have been down for over a day, like all the other sites in the same situation.

During that EBS outage, I, like many others, watched Netflix, also hosted on AWS. It is said that if you’re on AWS and your site is down, but Netflix is up, it’s probably your fault you are down. It was that reputation, among other things, that drew me to move from reddit to Netflix, which I did in July 2011. Now that I’m responsible for Netflix’s uptime, it is my job to help the company maintain that reputation.

Netflix requires a superior level of reliability. With tens of thousands of instances and 30 million plus paying customers, reliability is absolutely critical. So how do we do it? We expect the inevitable failure, plan for it, and even cause it sometimes. At Netflix, we follow our monkey theory—we simulate things that go wrong and find things that are different. And thus was born the Simian Army, our collection of agents that constructively muck with our AWS environment to make us more resilient to failure.

The most famous of these is the Chaos Monkey, which kills random instances in our production account—the same account that serves actual, live customers. Why wait for Amazon to fail when you can induce the failure yourself, right? We also have the Latency Monkey, which induces latency on connections between services to simulate network issues. We have a whole host of other monkeys too (most of them available on Github).

The point of the Monkeys is to make sure we are ready for any failure modes. Sometimes it works, and we avoid outages, and sometimes new failures come up that we haven’t planned for. In those cases, our resiliency systems are truly tested, making sure they are generic and broad enough to handle the situation.

One failure that we weren’t prepared for was in June 2012. A severe storm hit Amazon’s complex in Virginia, and they lost power to one of their data centers (a.k.a. Availability Zones). Due to a bug in the mid-tier load balancer that we wrote, we did not route traffic away from the affected zone, which caused a cascading failure. This failure, however, was our fault, and we learned an important lesson. This incident also highlighted the need for the Chaos Gorilla, which we successfully ran just a month later, intentionally taking out an entire zone’s worth of servers to see what would happen (everything went smoothly). We ran another test of the Chaos Gorilla a few months later and learned even more about what were are doing right and where we could do better.

A few months later, there was another zone outage, this time due to the Elastic Block Store. Although we generally don’t use EBS, many of our instances use EBS root volumes. As such, we had to abandon an availability zone. Luckily for us, our previous run of Chaos Gorilla gave us not only the confidence to make the call to abandon a zone, but also the tools to make it quick and relatively painless.

Looking back, there are plenty of other things we could have done to make reddit more resilient to failure, many of which I have learned through ad hoc trial and error, as well as from working at Netflix. Unfortunately, I didn’t have a book like this one to guide me. This book outlines in excellent detail exactly how to build resilient systems in the cloud. From the crash course in systems to the detailed instructions on specific technologies, this book includes many of the very same things we stumbled upon as we flailed wildly, discovering solutions to problems. If I had had this book when I was first starting on AWS, I would have saved myself a lot of time and headache, and hopefully you will benefit from its knowledge after reading it.

This book also teaches a very important lesson: to embrace and expect failure, and if you do, you will be much better off.

Get Resilience and Reliability on AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.