Analyzing word frequencies

The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:

punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

Create a FreqDist object and print associated keys and values with highest frequency:

fd = nltk.FreqDist(filtered)
print "Words", fd.keys()[:5]
print "Counts", fd.values()[:5]

The keys and values are printed as follows:

Words ['d', 'caesar', 'brutus', 'bru', 'haue']
Counts [215, 190, 161, 153, 148]

The first word in this list is of course not an English word, so we may need to add the heuristic ...

Get Python Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.