Applying Zipf's law to text

Zipf's law states that the frequency of a token in a text is directly proportional to its rank or position in the sorted list. This law describes how tokens are distributed in languages: some tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely occur.

Let's see the code for obtaining the log-log plot in NLTK that is based on Zipf's law:

>>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.probability import FreqDist >>> import matplotlib >>> import matplotlib.pyplot as plt >>> matplotlib.use('TkAgg') >>> fd = FreqDist() >>> for text in gutenberg.fileids(): . . . for word in gutenberg.words(text): . . . fd.inc(word) >>> ranks = [] >>> freqs = [] >>> for rank, word ...

Get Natural Language Processing: Python and NLTK now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Natural Language Processing: Python and NLTK by Nitin Hardeniya, Jacob Perkins, Deepti Chopra, Nisheeth Joshi, Iti Mathur

Applying Zipf's law to text

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly