Learning from First Responders: When Your Systems Have to Work

Starting the Exercise

The first thing on the schedule for Sunday was to validate the staging environment to make sure that it was as close to production as possible. We had maintained a staging environment that was functionally equivalent to production, but generally with a smaller base state (single availability zone rather than three, and sometimes smaller box baselines) that we used for final integration testing, infrastructure shakeout, and load testing.

In this case, we decided to use it as a viable stand-in for production by beefing it up to production-like scale and simulating a bit of load with some small scripts. All of our applications had been built to scale horizontally from the beginning and had been rather extensively load tested in previous tests. With that in mind we launched a couple of bash scripts that simulated enough load to simulate light usage, not full scale. Having some load was important as much of our logging and alerting was based on actual failures that would only come with actual use. While it would be ideal to run a test like this in production, we were testing rather drastic failures that could have incredible front-facing effects if anything should fail unexpectedly. Seeing as failure had not yet been tested, any and all failures would fail unexpectedly. Given this, and that we were talking about the website of the President of the United States of America, we decided to go with the extremely safe approximation of production.

While the engineers were validating the staging environment, Nick Hatch on our devops team and I worked to set up our own backchannel to what was going to be happening. As the orchestrators of the failures, we needed a venue to document the changes that we would be making that would be inflicting the failures on the engineers.

In addition to the backchannel, we (the devops team and I) decided that since we were attempting to keep this as close to what a real incident would be like, and since we were all nerds, that we should essentially live action role play (LARP) the entire exercise. The devops team would be simultaneously causing the destruction as well as helping the engineers through it. It was vital to the success of the exercise that the devops team have split personalities, that they not let what we were actually doing leak through, and instead work through normal detection with the engineers without that knowledge.

One thing to be expected in a game day is that you cannot expect it to go according to plan, even for those who are planning it. The organizers of the event and the individuals participating in it should be ready and willing to go with the flow and make adjustments on the fly. In fact, just as the engineers were sending the final tags to staging and the plan set with Nick Hatch on how to begin the game (the first “issue” that would occur would be loss of all the database replicants, a supposed no-op) when all of a sudden reports began to trickle in about legitimate issues downstream that were affecting the payment processor.

The OFA Incident Response Campfire chat room was suddenly a cacophony of engineers wondering if this was really happening or a part of the test. In an effort to keep game day as close to a real incident as possible, we intended to use our actual Incident Response channel for real-time communication. With real incidents impinging on game day, it became clear that this was not going to work — the decision was made to separate the simulated incident response discussion into its own channel on the fly and the tech finance team would have to fight both the production issue as well as any game-day issues at the same time. After all, when it rains it pours.

With adjustments to the plan made, Hatch changed the security group on the database replicas and as far as all of the code was concerned, the replicas were down. During this action and from there on out, no communication was made with the engineers about what was happening. The organizers of the game day sat back and watched the incident response channel to see if the engineers would notice that they were operating with only a single master now and how long it would take. It took about four minutes.

Four minutes to recognize a failure event while the engineers were anticipating failures is way too long. Four minutes while engineers are paying attention is 15 minutes or more when no one is looking. If it’s a failure condition, you should alert on it. If it’s a condition of note, you should measure it. If you’re measuring it using statsd and Graphite you can alert using Seyren. The teams were using all of these tools, but they weren’t properly alerting on many common failure cases.

At this point it was clear that there were legitimate process and alerting issues inherent to our infrastructure. Engineers asked the organizers to pause the event so logs could be analyzed to ensure that all of the right spots had been hit and the new code was acting as expected. Well, this was a simulation of real life, and real life doesn’t stop and wait. Real life says that if you lose your replicants and move all reads to your master, your master is probably going to die. That’s precisely what happened next.

Get Learning from First Responders: When Your Systems Have to Work now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Learning from First Responders: When Your Systems Have to Work by Dylan Richard

Starting the Exercise

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly