Prologue

Term Frequency

LIKE QUENEAU'S STORY, the computational task in this book is trivial: given a text file, we want to display the N (e.g. 25) most frequent words and corresponding frequencies ordered by decreasing value of frequency. We should make sure to normalize for capitalization and to ignore stop words like "the", "for", etc. To keep things simple, we don't care about the ordering of words that have equal frequencies. This computational task is known as term frequency.

Here is an example of an input file and corresponding output after computing the term frequency:

Input:
 White tigers live mostly in India
 Wild lions live mostly in Africa
Output:
 live - 2
 mostly - 2
 africa - 1
 india - 1
 lions - 1
 tigers - 1
 white - 1

Get Exercises in Programming Style now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.