The Original Unix Spellchecking Prototype

Spellchecking has been the subject of more than 300 research papers and books.[1] In his book Programming Pearls,[2] Jon Bentley reported: Steve Johnson wrote the first version of spell in an afternoon in 1975. Bentley then sketched a reconstruction credited to Kernighan and Plauger[3] of that program as a Unix pipeline that we can rephrase in modern terms like this:

            prepare 
            filename |                       Remove formatting commands
  tr A-Z a-z |                           Map uppercase to lowercase 
    tr -c a-z '\n' |                     Remove punctuation
      sort |                             Put words in alphabetical order
        uniq |                           Remove duplicate words
          comm -13 dictionary -          Report words not in dictionary

Here, prepare is a filter that strips whatever document markup is present; in the simplest case, it is just cat. We assume the argument syntax for the GNU version of the tr command.

The only program in this pipeline that we have not seen before is comm: it compares two sorted files and selects, or rejects, lines common to both. Here, with the -13 option, it outputs only lines from the second file (the piped input) that are not in the first file (the dictionary). That output is the spelling-exception report.

Get Classic Shell Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.