Filtering out stopwords, names, and numbers

It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:

sw = set(nltk.corpus.stopwords.words('english'))
print "Stop words", list(sw)[:7]

The following common words are printed:

Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']

Notice that all the words in this corpus are in lowercase.

NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).

Load the Gutenberg corpus and ...

Get Python Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.