Filtering out stopwords, names, and numbers

It's a common requirement in text analysis to get rid of stopwords (common words with low information value). NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:

sw = set(nltk.corpus.stopwords.words('english'))
print "Stop words", list(sw)[:7]

The following common words are printed:

Stop words ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves']

Notice that all the words in this corpus are in lowercase.

NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books mostly with expired copyright, which are available for free on the Internet (see

Load the Gutenberg corpus and ...

