Small Batches Reduce Overhead

In my experience, this is the most counterintuitive of its effects. Most organizations have their batch size tuned so as to reduce their overhead. For example, if QA takes a week to certify a release, it's likely that the company does releases no more than once every 30 or 60 days. Telling such a company that it should work in a two-week batch size sounds absurd—the company would spend 50% of its time waiting for QA to certify the release! But this argument is not quite right. This is something so surprising that I didn't really believe it the first few times I saw it in action. It turns out that organizations get better at the things they do very often. So, when we start checking in code more often, release more often, or conduct more frequent design reviews, we can actually do a lot to make those steps dramatically more efficient.

Of course, that doesn't necessarily mean we will make those steps more efficient. A common line of argument is: if we have the power to make a step more efficient, why don't we invest in that infrastructure first, and then reduce the batch size as we lower the overhead? This makes sense, and yet it rarely works. The bottlenecks that large batches cause are often hidden, and it takes work to make them evident, and even more work to invest in fixing them. When the existing system is working "good enough" these projects inevitably languish.

These changes pay increasing dividends, because each improvement now directly frees up somebody in QA or operations while also reducing the total time required for the certification step. Those freed-up resources might be able to spend some of that time helping the development team actually prevent bugs in the first place, or just take on some of their routine work. That frees up even more development resources, and so on. Pretty soon, the team can be developing and testing in a continuous feedback loop, addressing micro-bottlenecks the moment they appear. If you've never had the chance to work in an environment like this, I highly recommend you try it. I doubt you'll go back.

Let me show you what this looked like for the operations and engineering teams at IMVU (http://www.imvu.com/). We had made so many improvements to our tools and processes for deployment that it was pretty hard to take the site down. We had five strong levels of defense:

  • Each engineer had his own sandbox that mimicked production as closely as possible (whenever it diverged, we'd inevitably find out in a "Five Whys" [http://startuplessonslearned.com/2008/11/five-whys.html] shortly thereafter).

  • We had a comprehensive set of unit, acceptance, functional, and performance tests, and practiced test-driven development (TDD) across the whole team. Our engineers built a series of test tags, so you could quickly run a subset of tests in your sandbox that you thought were relevant to your current project or feature.

  • One hundred percent of those tests ran, via a continuous integration cluster, after every check-in. When a test failed, it would prevent that revision from being deployed.

  • When someone wanted to do a deployment, we had a completely automated system that we called the cluster immune system. This would deploy the change incrementally, one machine at a time. That process would continually monitor the health of those machines, as well as the cluster as a whole, to see if the change was causing problems. If it didn't like what was going on, it would reject the change, do a fast revert, and lock deployments until someone investigated what went wrong.

  • We had a comprehensive set of Nagios alerts that would trigger a pager in operations if anything went wrong. Because Five Whys kept turning up a few key metrics that were hard to set static thresholds for, we even had a dynamic prediction algorithm that would make forecasts based on past data and fire alerts if the metric ever went out of its normal bounds.

So, if you had been able to sneak into the desks of any of our engineers, log in to their machines, and secretly check in an infinite loop on some highly trafficked page, here's what would have happened. Somewhere between 10 and 20 minutes later, they would have received an email with a message that read something like this:

Dear so-and-so,

Thank you so much for attempting to check in revision 1234. Unfortunately, that is a terrible idea, and your change has been reverted. We've also alerted the whole team to what's happened and look forward to you figuring out what went wrong.

Best of luck,

Your Software

(OK, that's not exactly what it said, but you get the idea.)

The goal of continuous deployment is to help development teams drive waste out of their process by simultaneously reducing the batch size (http://startuplessonslearned.com/2009/02/work-in-small-batches.html) and increasing the tempo of their work. This makes it possible for teams to get—and stay—in a condition of flow for sustained periods. This condition makes it much easier for teams to innovate, experiment, and achieve sustained productivity, and it nicely complements other continuous improvement systems, such as Five Whys, which we'll discuss later in this chapter.

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.