On-call Developers

One of the most common misconceptions about separating operations from development is that it requires members of the operations team to be the only ones who work on fixing production issues. This setup avoids the development interruptions that cause delays to new features but insulates developers from the realities of code running in a 24/7 production environment.

So far in this chapter we've discussed the need for operations to trust that developers have their best interests in mind, that developers should be allowed to deploy code without formal approval processes, and that developers should have access to production systems. This works only if developers take on responsibility for fixing problems with their code in production, which means they need to be on call.

At Flickr, the operations team operates a standard primary/backup pager rotation. In almost all cases, it's a known failure scenario and the on-call engineer can resolve the issue with a well-understood playbook. But every so often there's an issue the on-call engineer can't debug or fix by himself. In these cases, he'll call one of the developers for help.

This environment creates some useful social dynamics. As anyone who has carried a pager knows, avoiding being paged at 3:00 a.m. on a Sunday is a powerful incentive to keep things from breaking and to make sure ops has all the tools it needs to fix a problem without you. As an example, Flickr has a shared IRC channel; it's very common to see something ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.