Regular expressions are notations for describing patterns of text and, in effect, make up a special-purpose language for pattern matching. Although there are myriad variants, all share the idea that most characters in a pattern match literal occurrences of themselves, but some metacharacters have special meaning, such as * to indicate some kind of repetition or […] to mean any one character from the set within the brackets.
In practice, most searches in programs such as text editors are for
literal words, so the regular expressions are often literal strings like
paper anywhere. In so-called wildcards used
to specify filenames in Unix and Windows, a * matches any number of
characters, so the pattern
all filenames that end in
.c. There are
many, many variants of regular expressions, even in contexts where one
would expect them to be the same. Jeffrey Friedl's Mastering
Regular Expressions (O'Reilly) is an exhaustive study of the
Stephen Kleene invented regular expressions in the mid-1950s as a notation for finite automata; in fact, they are equivalent to finite automata in what they represent. They first appeared in a program setting in Ken Thompson's version of the QED text editor in the mid-1960s. In 1967, Thompson applied for a patent on a mechanism for rapid text matching based on regular expressions. The patent was granted in 1971, one of the very first software patents [U.S. Patent 3,568,156, Text Matching Algorithm, March 2, 1971].
Regular expressions moved from QED to the Unix editor ed, and then to the quintessential Unix tool grep, which Thompson created by performing radical surgery on ed. These widely used programs helped regular expressions become familiar throughout the early Unix community.
Thompson's original matcher was very fast because it combined two independent ideas. One was to generate machine instructions on the fly during matching so that it ran at machine speed rather than by interpretation. The other was to carry forward all possible matches at each stage, so it did not have to backtrack to look for alternative potential matches. In later text editors that Thompson wrote, such as ed, the matching code used a simpler algorithm that backtracked when necessary. In theory, this is slower, but the patterns found in practice rarely involved backtracking, so the ed and grep algorithm and code were good enough for most purposes.
Subsequent regular expression matchers like egrep and fgrep added richer classes of regular expressions, and focused on fast execution no matter what the pattern. Ever-fancier regular expressions became popular and were included not only in C-based libraries, but also as part of the syntax of scripting languages such as Awk and Perl.
In 1998, Rob Pike and I were writing The Practice of
Programming (Addison-Wesley). The last chapter of the book,
"Notation," collected a number of examples where good
notation led to better programs and better programming.
This included the use of simple data specifications (
printf, for instance), and the generation of
code from tables.
Because of our Unix backgrounds and nearly 30 years of experience with tools based on regular expression notation, we naturally wanted to include a discussion of regular expressions, and it seemed mandatory to include an implementation as well. Given our emphasis on tools, it also seemed best to focus on the class of regular expressions found in grep—rather than, say, those from shell wildcards—since we could also then talk about the design of grep itself.
The problem was that any existing regular expression package was far too big. The local grep was over 500 lines long (about 10 book pages) and encrusted with barnacles. Open source regular expression packages tended to be huge—roughly the size of the entire book—because they were engineered for generality, flexibility, and speed; none were remotely suitable for pedagogy.
I suggested to Rob that we find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and nontrivial class of patterns. Ideally, the code would fit on a single page.
Rob disappeared into his office. As I remember it now, he emerged in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of The Practice of Programming. That code implements a regular expression matcher that handles the following constructs.
Matches any literal character c.
Matches any single character.
Matches the beginning of the input string.
Matches zero or more occurrences of the previous character.
This is quite a useful class; in my own experience of using regular expressions on a day-to-day basis, it easily accounts for 95 percent of all instances. In many situations, solving the right problem is a big step toward creating a beautiful program. Rob deserves great credit for choosing a very small yet important, well-defined, and extensible set of features from among a wide set of options.
Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of good notation in making a program easier to use (and perhaps easier to write as well), the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics.