Thinking About System Administration

I’ve touched briefly on some of the nontechnical aspects of system administration. These dynamics will probably not be an issue if it really is just you and your PC, but if you interact with other people at all, you’ll encounter these issues. It’s a cliché that system administration is a thankless job—one widely-reprinted cartoon has a user saying “I’d thank you but system administration is a thankless job”—but things are actually more complicated than that. As another cliché puts it, system administration is like keeping the trains on time; no one notices except when they’re late.

System management often seems to involve a tension between authority and responsibility on the one hand and service and cooperation on the other. The extremes seem easier to maintain than any middle ground; fascistic dictators who rule “their system” with an iron hand, unhindered by the needs of users, find their opposite in the harried system managers who jump from one user request to the next, in continual interrupt mode. The trick is to find a balance between being accessible to users and their needs—and sometimes even to their mere wants—while still maintaining your authority and sticking to the policies you’ve put in place for the overall system welfare. For me, the goal of effective system administration is to provide an environment where users can get done what they need to, in as easy and efficient a manner as possible, given the demands of security, other users’ needs, the inherent capabilities of the system, and the realities and constraints of the human community in which they all are located.

To put it more concretely, the key to successful, productive system administration is knowing when to solve a CPU-overuse problem with a command like:[1]

# kill -9 `ps aux | awk '$1=="chavez" {print $2}'

(This command blows away all of user chavez’s processes.) It’s also knowing when to use:

$ write chavez 
You've got a lot of identical processes running on dalton. 
Any problem I can help with? 
^D

and when to walk over to her desk and talk with her face-to-face. The first approach displays Unix finesse as well as administrative brute force, and both tactics are certainly appropriate—even vital—at times. At oth er times, a simpler, less aggressive approach will work better to resolve your system’s performance problems in addition to the user’s confusion. It’s also important to remember that there are some problems no Unix command can address.

To a great extent, successful system administration is a combination of careful planning andhabit, however much it may seem like crisis intervention at times. The key to handling a crisis well lies in having had the foresight and taken the time to anticipate and plan for the type of emergency that has just come up. As long as it only happens once in a great while, snatching victory from the jaws of defeat can be very satisfying and even exhilarating.

On the other hand, many crises can be prevented altogether by a determined devotion to carrying out all the careful procedures you’ve designed: changing the root password regularly, faithfully making backups (no matter how tedious), closely monitoring system logs, logging out and clearing the terminal screen as a ritual, testing every change several times before letting it loose, sticking to policies you’ve set for users’ benefit—whatever you need to do for your system. (Emerson said, “A foolish consistency is the hobgoblin of little minds,” but not a wise one.)

My philosophy of system administration boils down to a few basic strategies that can be applied to virtually any of its component tasks:

  • Know how things work. In these days, when operating systems are marketed as requiring little or no system administration, and the omnipresent simple-to-use tools attempt to make system administration simple for an uninformed novice, someone has to understand the nuances and details of how things really work. It should be you.

  • Plan it before you do it.

  • Make it reversible (backups help a lot with this one).

  • Make changes incrementally.

  • Test, test, test, before you unleash it on the world.

I learned about the importance of reversibility from a friend who worked in a museum putting together ancient pottery fragments. The museum followed this practice so that if better reconstructive techniques were developed in the future, they could undo the current work and use the better method. As far as possible, I’ve tried to do the same with computers, adding changes gradually and preserving a path by which to back out of them.

A simple example of this sort of attitude in action concerns editing system configuration files. Unix systems rely on many configuration files, and every major subsystem has its own files (all of which we’ll get to). Many of these will need to be modified from time to time.

I never modify the original copy of the configuration file, either as delivered with the system or as I found it when I took over the system. Rather, I always make a copy of these files the first time I change them, appending the suffix .dist to the filename; for example:

# cd /etc
# cp inittab inittab.dist
# chmod a-w inittab.dist

I write-protect the .dist file so I’ll always have it to refer to. On systems that support it, use the cp command’s -p option to replicate the file’s current modification time in the copy.

I also make a copy of the current configuration file before changing it in any way so undesirable changes can be easily undone. I add a suffix like . old or . sav to the filename for these copies. At the same time, I formulate a plan (at least in my head) about how I would recover from the worst consequence I can envision of an unsuccessful change (e.g., I’ll boot to single-user mode and copy the old version back).

Once I’ve made the necessary changes (or the first major change, when several are needed), I test the new version of the file, in a safe (nonproduction) environment if possible. Of course, testing doesn’t always find every bug or prevent every problem, but it eliminates the most obvious ones. Making only one major change at a time also makes testing easier.

Note

Some administrators use the a revision control system to track the changes to important system configuration files (e.g., CVS or RCS). Such packages are designed to track and manage changes to application source code by multiple programmers, but they can also be used to record changes to configuration files. Using a revision control system allows you to record the author and reason for any particular change, as well as reconstruct any previous version of a file at any time.

The remaining sections of this chapter discuss some important administrative tools. The first describes how to become the superuser (the Unix privileged account). Because I believe a good system manager needs to have both technical expertise and an awareness of and sensitivity to the user community of which he’s a part, this first chapter includes a section on Unix communication commands. The goal of these discussions—as well as of this book as a whole—is to highlight how a system manager thinks about system tasks and problems, rather than merely to provide literal, cookbook solutions for common scenarios.

Important administrative tools of other kinds are covered in later chapters of this book.



[1] On HP-UX systems, the command is ps -ef. Solaris systems can run either form depending on which version of ps comes first in the search path. AIX and Linux can emulate both versions, depending on whether a hyphen is used with options (System V style) or not (BSD style).

Get Essential System Administration, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.