Word Lists

From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Chapter 16. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,[7] each of which took several hours to write. Bentley's original specification was imprecise, so Hanson rephrased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.

In the first of Bentley's articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth's program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy's is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.

McIlroy's program ...

Get Classic Shell Scripting now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.