The Original Unix Spellchecking Prototype
Spellchecking has been the subject of more than 300 research papers and books.[1] In his book Programming Pearls,[2] Jon Bentley reported: Steve Johnson wrote the first version of spell in an afternoon in 1975. Bentley then sketched a reconstruction credited to Kernighan and Plauger[3] of that program as a Unix pipeline that we can rephrase in modern terms like this:
prepare
filename
| Remove formatting commands tr A-Z a-z | Map uppercase to lowercase tr -c a-z '\n' | Remove punctuation sort | Put words in alphabetical order uniq | Remove duplicate words comm -13dictionary
- Report words not in dictionary
Here, prepare
is a filter that strips
whatever document markup is present; in the simplest case, it is just
cat. We assume the argument syntax
for the GNU version of the tr
command.
The only program in this pipeline that we have not seen
before is comm: it compares two
sorted files and selects, or rejects, lines common to both. Here, with
the -13
option, it outputs only lines from the second
file (the piped input) that are not in the first file (the dictionary).
That output is the spelling-exception report.
Get Classic Shell Scripting now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.