So, What's the Problem?

The sys admin subculture almost prides itself on firefighting and reliance on secret knowledge. Several years ago, I would prepare for deployments at an e-commerce start-up I was working for by picking up an energy drink on the way into the office for the late-night ceremony. (I was partial to the 24-ounce Rockstar Juiced Pomegranate at the time but have since moved on to the 180 Açai Berry, but I digress.) The ritual began with reciting incantations and starting some scripts. Files were moved, processes were restarted, sometimes schemas were changed or new systems were added. The process was an unpredictable mix of running scripts, babysitting the results, tailing logs, and watching the monitoring. "Operations" wasn't part of my title or official duties, but I knew my alternative was to go to bed and chance an outage. If things went well, I never had to pop the top on my cylinder of sugar and caffeine; but my desk was decorated with a platoon of empty cans. The production infrastructure was roughly two dozen eight-core machines with all the RAM they could handle, smallish by web scale standards but complex enough to manifest pathology. A major source of problems was the inconsistent configuration between machines and between environments, which meant the same code might behave or perform differently from one machine to the next. Another source of problems was the inconsistent deployment process introducing the opportunity for human error with a lot of manual ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.