Applying Zipf's law to text

Zipf's law states that the frequency of a token in a text is directly proportional to its rank or position in the sorted list. This law describes how tokens are distributed in languages: some tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely occur.

Let's see the code for obtaining the log-log plot in NLTK that is based on Zipf's law:

>>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.probability import FreqDist >>> import matplotlib >>> import matplotlib.pyplot as plt >>> matplotlib.use('TkAgg') >>> fd = FreqDist() >>> for text in gutenberg.fileids(): . . . for word in gutenberg.words(text): . . . fd.inc(word) >>> ranks = [] >>> freqs = [] >>> for rank, word ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.