From Apprentice to Master

When you allow yourself to meditate on a question, the answer most often is simple and rather unoriginal. It turns out that being a master web operations engineer is no different from being a master carpenter or a master teacher. The effort to master any given discipline requires four basic pursuits: knowledge, tools, experience, and discipline.

Knowledge

Knowledge is a uniquely simple subject on the Internet. The Internet acts as a very effective knowledge-retention system. The common answer to many questions, "Let me Google that for you," is an amazingly effective and high-yield answer. Almost everything you want to know (and have no desire to know) about operating web infrastructure is, you guessed it, on the Web.

Limiting yourself to the Web for information is, well, limiting. You are not alone in this adventure, despite the feeling. You have peers, and they need you as much as you need them. User groups (of a startling variety) exist around the globe and are an excellent place to share knowledge.

If you are reading this, you already understand the value of knowledge through books. A healthy bookshelf is something all master web operations engineers have in common. Try to start a book club in your organization, or if your organization is too small, ask around at a local user group.

One unique aspect of the Internet industry is that almost nothing is secret. In fact, very little is even proprietary and, quite uniquely, almost all specifications are free. How does the Internet work? Switching: there is an IEEE specification for that. IP: there is RFC 791 for that. TCP: RFC 793. HTTP: RFC 2616. They are all there for the reading and provide a much deeper foundational base of understanding. These protocols are the rules by which you provide services, and the better you understand them, the more educated your decisions will be. But don't stop there! TCP might be described in RFC 793, but all sorts of TCP details and extensions and "evolution" are described in related RFCs such as 1323, 2001, 2018, and 2581. Perhaps it's also worthwhile to understand where TCP came from: RFC 761.

To revisit the theory and practice conundrum, the RFC for TCP is the theory; the kernel code that implements the TCP stack in each operating system is the practice. The glorious collision of theory and practice are the nuances of interoperability (or inter-inoperability) of the different TCP implementations, and the explosions are slow download speeds, hung sessions, and frustrated users.

On your path from apprentice to master, it is your job to retain as much information as possible so that the curiously powerful coil of jello between your ears can sort, filter, and correlate all that trivia into a concise and accurate picture used to power decisions: both the long-term critical decisions of architecture design and the momentary critical decisions of fault remediation.

Tools

Tools, in my experience, are one of the most incessantly and emphatically argued topics in computing: vi versus Emacs, Subversion versus Git, Java versus PHP—beginning as arguments from different camps but rapidly evolving into nonsensical religious wars.

The simple truth is that people are successful with these tools despite their pros and cons. Why do people use all these different tools, and why do we keep making more? I think Thomas Carlyle and Benjamin Franklin noted something important about our nature as humans when they said "man is a tool-using animal" and "man is a tool-making animal," respectively. Because it is in our nature to build and use tools, why must we argue fruitlessly about their merits? Although Thoreau meant something equally poignant, I feel his commentary that "men have become the tools of their tools" is equally accurate in the context of modern vernacular.

The simple truth is articulated best by Emerson: "All the tools and engines on Earth are only extensions of man's limbs and senses." This articulates well the ancient sentiment that a tool does not the master craftsman make. In the context of Internet applications, you can see this in the wide variety of languages, platforms, and technologies that are glued together successfully. It isn't Java or PHP that makes an architecture successful, it is the engineers that design and implement it—the craftsmen.

One truth about engineering is that knowing your tools, regardless of the tools that are used, is a prerequisite to mastering the trade. Your tools must become extensions of your limbs and senses. It should be quite obvious to engineers and nonengineers alike that reading the documentation for a tool during a crisis is not the best use of one's time. Knowing your tools goes above and beyond mere competency; you must know the effects they produce and how they interact with your environment—you must be practiced.

A great tool in any operations engineer's tool chest is a system call tracer. They vary (slightly) from system to system. Solaris has truss, Linux has strace, FreeBSD has ktrace, and Mac OS X had ktrace but displaced that with the less useful dtruss. A system call tracer is a peephole into the interaction between user space and kernel space; in other words, if you aren't computationally bound, this tool tells you what exactly your application is asking for and how long it takes to be satisfied.

DTrace is a uniquely positioned tool available on Solaris, OpenSolaris, FreeBSD, Mac OS X, and a few other platforms. This isn't really a chapter on tools, but DTrace certainly deserves a mention. DTrace is a huge leap forward in system observability and allows the craftsman to understand his system like never before; however, DTrace is an oracle in both its perspicacity and the fact that the quality of its answers is coupled tightly with the quality of the question asked of it. System call tracers, on the other hand, are a proverbial avalanche—easy to induce and challenging to navigate.

Why are we talking about avalanches and oracles? It is an aptly mixed metaphor for the amorphous and heterogeneous architectures that power the Web. Using strace to inspect what your web server is doing can be quite enlightening (and often results in some easily won optimizations the first few times). Looking at the output for the first time when something has gone wrong provides basically no value except to the most skilled engineers; in fact, it can often cost you. The issue is that this is an experiment, and you have no control. When something is "wrong" it would be logical to look at the output from such a tool in an attempt to recognize an unfamiliar pattern. It should be quite clear that if you have failed to use the tool under normal operating conditions, you have no basis for comparison, and all patterns are unfamiliar. In fact, it is often the case that patterns that appear to be correlated to the problem are not, and much time is wasted pursuing red herrings.

Diffusing the tools argument is important. You should strive to choose a tool based on its appropriateness for the problem at hand rather than to indulge your personal preference. An excellent case in point is the absolutely superb release management of the FreeBSD project over its lifetime using what is now considered by most to be a completely antiquated version control system (CVS). Many successful architectures have been built atop the PHP language, which lacks many of the features of common modern languages. On the flip side, many projects fail even when equipped with the most robust and capable tools. The quality of the tool itself is always far less important than the adroitness with which it is wielded. That being said, a master craftsman should always select an appropriate, high-quality tool for the task at hand.

Experience

Experience is one of the most powerful weapons in any situation. It is so important because it means so many things. Experience is, in its very essence, making good judgments, and it is gained by making bad ones. Watching theory and practice collide is both scary and beautiful. The collision inevitably has casualties—lost data, unavailable services, angered users, and lost money—but at the same time its full context and pathology have profound beauty. Assumptions have been challenged (and you have lost) and unexpected outcomes have manifested, and above all else, you have the elusive opportunity to be a pathologist and gain a deeper understanding of a new place in your universe where theory and practice bifurcate.

Experience and knowledge are quite interrelated. Knowledge can be considered the studying of experiences of others. You have the information but have not grasped the deeper meaning that is gained by directly experiencing the causality. That deeper meaning allows you to apply the lesson learned in other situations where your experience-honed insight perceives correlations—an insight that often escapes those with knowledge alone.

Experience is both a noun and a verb: gaining it is as easy (and as hard) as doing it.

The organizational challenge of inexperience

Although gaining experience is as easy as simply "doing," in the case of web operations it is the process of making and surviving bad judgments. The question is: how can an organization that is competing in such an aggressive industry afford to have its staff members make bad judgments? Having and executing on an answer to this question is fundamental to any company that wants to house career-oriented web operations engineers. There are two parts to this answer, a yin and yang if you will.

The first is to make it safe for junior and mid-level engineers to make bad judgments. You accomplish this by limiting liability and injury from individual judgments. The environment (workplace, network, systems, and code) can all survive a bad judgment now and again. You never want to be forced into the position of firing an individual because of a single instance of bad judgment (although I realize this cannot be entirely prevented, it is a good goal). The larger the mistake, the more profound the opportunity to extract deep and lasting value from the lesson. This leads us to the second part of the answer.

Never allow the same bad judgment twice. Mistakes happen. Bad judgments will occur as a matter of fact. Not learning from one's mistakes is inexcusable. Although exceptions always exist, you should expect and promote a culture of zero tolerance for repetitious bad judgment.

The concept of "senior operations"

One thing that has bothered me for quite some time and continues to bother me is job applications from junior operations engineers for senior positions. Their presumption is that knowledge dictates hierarchical position within a team; just as in other disciplines, this is flat-out wrong. The single biggest characteristic of a senior engineer is consistent and solid good judgment. This obviously requires exposure to situations where judgment is required and is simple math: the rate of difficult situations requiring judgment multiplied by tenure. It is possible to be on a "fast track" by landing an operations position in which disasters strike at every possible moment. It is also possible to spend 10 years in a position with no challenging decisions and, as a result, accumulate no valuable experience.

Generation X (and even more so, Generation Y) are cultures of immediate gratification. I've worked with a staggering number of engineers who expect their "career path" to take them to the highest ranks of the engineering group inside five years just because they are smart. This is simply impossible in the staggering numbers I've witnessed. Not everyone can be senior. If, after five years, you are senior, are you at the peak of your game? After five more years will you not have accrued more invaluable experience? What then: "super engineer"? What about five years later: "super-duper engineer"? I blame the youth of our discipline for this affliction. The truth is that very few engineers have been in the field of web operations for 15 years. Given the dynamics of our industry, many elected to move on to managerial positions or risk an entrepreneurial run at things.

I have some advice for individuals entering this field with little experience: be patient. However, this adage is typically paradoxical, as your patience very well may run out before you comprehend it.

Discipline

Discipline, in my opinion, is the single biggest disaster in our industry. Web operations has an atrocious track record when it comes to structure, process, and discipline. As a part of my job, I do a lot of assessments. I go into companies and review their organizational structure, operational practices, and overall architecture to identify when and where they will break down as business operations scale up.

Can you guess what I see more often than not? I see lazy cowboys and gunslingers; it's the Wild, Wild West. Laziness is often touted as a desired quality in a programmer. In the Perl community, where this became part of the mantra, the meaning was tongue-in-cheek (further exemplified by the use of the word hubris in the same mantra). What is meant is that by doing things as correctly and efficiently as possible you end up doing as little work as possible to solve a particular problem—this is actually quite far from laziness. Unfortunately, others in the programming and operations fields have taken actual laziness as a point of pride to which I say, "not in my house."

Discipline is controlled behavior resulting from training, study, and practice. In my experience, a lack of discipline is the most common ingredient left out of a web operations team and results in inconsistency and nonperformance.

Discipline is not something that can be taught via a book; it is something that must be learned through practice. Each task you undertake should be approached from the perspective of a resident. Treating your position and responsibilities as long term and approaching problems to develop solutions that you will be satisfied with five years down the road is a good basis for the practice that results in discipline.

I find it ironic that software engineering (a closely related field) has a rather good track record of discipline. I conjecture that the underlying reason for a lack of discipline within the field of web operations is the lack of a career path itself. Although it may seem like a chicken-and-egg problem, I have overwhelming confidence that we are close to rewarding our field with an understood career path.

It is important for engineers who work in the field now to participate in sculpting what a career in operations looks like. The Web is here to stay, and services thereon are becoming increasingly critical. Web operations "the career" is inevitable. By participating, you can help to ensure that the aspect of your job that seduced you in the first place carries through into your career.

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Web Operations by John Allspaw, Jesse Robbins