Getting Started

Continuous deployment is controversial. When most people first hear about continuous deployment, they think I'm advocating low-quality code (http://www.developsense.com/2009/03/50-deployments-day-and-perpetual-beta.html) or an undisciplined Cowboy-coding development process (http://lastinfirstout.blogspot.com/2009/03/continuous-deployment-debate.html). On the contrary, I believe that continuous deployment requires tremendous discipline and can greatly enhance software quality, by applying a rigorous set of standards to every change to prevent regressions, outages, or harm to key business metrics. Another common reaction I hear to continuous deployment is that it's too complicated, it's time-consuming, or it's hard to prioritize. It's this latter fear that I'd like to address head-on in this chapter. Although it is true that the full system we use to support deploying 50 times a day at IMVU is elaborate, it certainly didn't start that way. By making a few simple investments and process changes, any development team can be on their way to continuous deployment. It's the journey, not the destination, which counts. Here's the why and how, in five steps.

Step 1: Continuous Integration Server

This is the backbone of continuous deployment. We need a centralized place where all automated tests (unit tests, functional tests, integration tests, everything) can be run and monitored upon every commit. Many fine, free software tools are available to make this easy—I have had success with Buildbot (http://buildbot.net). Whatever tool you use, it's important that it can run all the tests your organization writes, in all languages and frameworks.

If you have only a few tests (or even none at all), don't despair. Simply set up the continuous integration server and agree to one simple rule: we'll add a new automated test every time we fix a bug. If you follow that rule, you'll start to immediately get testing where it's needed most: in the parts of your code that have the most bugs and therefore drive the most waste for your developers. Even better, these tests will start to pay immediate dividends by propping up that most-unstable code and freeing up a lot of time that used to be devoted to finding and fixing regressions (a.k.a. firefighting).

If you already have a lot of tests, make sure the continuous integration server spends only a small amount of time on a full run; 10 to 30 minutes at most. If that's not possible, simply partition the tests across multiple machines until you get the time down to something reasonable.

For more on the nuts and bolts of setting up continuous integration, see "Continuous integration step-by-step" (http://startuplessonslearned.com/2008/12/continuous-integration-step-by-step.html).

Step 2: Source Control Commit Check

The next piece of infrastructure we need is a source control server with a commit-check script. I've seen this implemented with CVS (http://www.nongnu.org/cvs), Subversion, or Perforce and have no reason to believe it isn't possible in any source control system. The most important thing is that you have the opportunity to run custom code at the moment a new commit is submitted but before the server accepts it. Your script should have the power to reject a change and report a message back to the person attempting to check in. This is a very handy place to enforce coding standards, especially those of the mechanical variety.

But its role in continuous deployment is much more important. This is the place you can control what I like to call "the production line," to borrow a metaphor from manufacturing. When something is going wrong with our systems at any place along the line, this script should halt new commits. So, if the continuous integration server runs a build and even one test breaks, the commit script should prohibit new code from being added to the repository. In subsequent steps, we'll add additional rules that also "stop the line," and therefore halt new commits.

This sets up the first important feedback loop that you need for continuous deployment. Our goal as a team is to work as fast as we can reliably produce high-quality code—and no faster. Going any "faster" is actually just creating delayed waste that will slow us down later. (This feedback loop is also discussed in detail at http://startuplessonslearned.com/2008/12/continuous-integration-step-by-step.html.)

Step 3: Simple Deployment Script

At IMVU, we built a serious deployment script that incrementally deploys software machine by machine and monitors the health of the cluster and the business along the way so that it can do a fast revert if something looks amiss. We call it a cluster immune system (http://www.slideshare.net/olragon/just-in-time-scalability-agile-methods-to-support-massive-growth-presentation-presentation-925519). But we didn't start out that way. In fact, attempting to build a complex deployment system like that from scratch is a bad idea.

Instead, start simple. It's not even important that you have an automated process, although as you practice you will get more automated over time. Rather, it's important that you do every deployment the same way and have a clear and published process for how to do it that you can evolve over time.

For most websites, I recommend starting with a simple script that just rsync's code to a version-specific directory on each target machine. If you are facile with Unix symlinks (http://www.mikerubel.org/computers/rsync_snapshots/), you can pretty easily set this up so that advancing to a new version (and hence, rolling back) is as easy as switching a single symlink on each server. But even if that's not appropriate for your setup, have a single script that does a deployment directly from source control.

When you want to push new code to production, require that everyone uses this one mechanism. Keep it manual, but simple, so that everyone knows how to use it. And most importantly, have it obey the same "production line" halting rules as the commit script. That is, make it impossible to do a deployment for a given revision if the continuous integration server hasn't yet run and had all tests pass for that revision.

Step 4: Real-Time Alerting

No matter how good your deployment process is bugs can still get through. The most annoying variety are bugs that don't manifest until hours or days after the code that caused them is deployed. To catch those nasty bugs, you need a monitoring platform that can let you know when things have gone awry, and get a human being involved in debugging them.

To start, I recommend a system such as the open source Nagios (http://www.nagios.org/). Out of the box, it can monitor basic system stats such as load average and disk utilization. For continuous deployment purposes, we want to be able to have it monitor business metrics such as simultaneous users or revenue per unit time. At the beginning, simply pick one or two of these metrics to use. Anything is fine to start, and it's important not to choose too many. The goal should be to wire the Nagios alerts up to a pager, cell phone, or high-priority email list that will wake someone up in the middle of the night if one of these metrics goes out of bounds. If the pager goes off too often, it won't get the attention it deserves, so start simple.

Follow this simple rule: every time the pager goes off, halt the production line (which will prevent check-ins and deployments). Fix the urgent problem, and don't resume the production line until you've had a chance to schedule a Five Whys meeting for root-cause analysis (RCA), which we'll discuss next.

Step 5: Root-Cause Analysis (Five Whys)

So far, we've talked about making modest investments in tools and infrastructure and adding a couple of simple rules to our development process. Most teams should be able to do everything we've talked about in a week or two, at the most, because most of the work involves installing and configuring off-the-shelf software.

Five Whys gets its name from the process of asking "why" recursively to uncover the true source of a given problem. The way Five Whys works to enable continuous deployment is when you add this rule: every time you do an RCA, make a proportional investment in prevention at each of the five levels you uncover. Proportional means the solution shouldn't be more expensive than the problem you're analyzing; a minor inconvenience for only a few customers should merit a much smaller investment than a multihour outage.

But no matter how small the problem is, always make some investments, and always make them at each level. Because our focus in this chapter is deployment, this means always asking the question, "Why was this problem not caught earlier in our deployment pipeline?" So, if a customer experienced a bug, why didn't Nagios alert us? Why didn't our deployment process catch it? Why didn't our continuous integration server catch it? For each question, make a small improvement.

Over months and years, these small improvements add up, much like compounding interest. But there is a reason this approach is superior to making a large upfront investment in a complex continuous deployment system modeled on IMVU's (or anyone else's). The payoff is that your system will be uniquely adapted to your particular system and circumstance. If most of your headaches come from performance problems in production, you'll naturally be forced to invest in prevention at the deployment/alerting stage. If your problems stem from badly factored code, which causes collateral damage for even small features or fixes, you'll naturally find yourself adding a lot of automated tests to your continuous integration server. Each problem drives investments in that category of solution. Thankfully, there's an 80/20 rule at work: 20% of your code and architecture probably drives 80% of your headaches. Investing in that 20% frees up incredible time and energy that can be invested in more productive things.

Following these five steps will not give you continuous deployment overnight. In its initial stages, most of your RCA will come back to the same problem: "We haven't invested in preventing that yet." But with patience and hard work, anyone can use these techniques to inexorably drive waste out of his development process.

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.