Given that sentence detection is probably the first task you’ll want
to ponder when building an NLP stack, it makes sense to start there. Even
if you never complete the remaining tasks in the pipeline, it turns out
that EOS detection alone yields some powerful possibilities such as
document summarization, which we’ll be considering as a follow-up
exercise. But first, we’ll need to fetch some high-quality blog data.
Let’s use the tried and true
module, which you can
easy_install if you don’t have it
already, to fetch some posts from the O’Reilly Radar blog. The listing in
Example 8-1 fetches a few posts and saves
them to a local file as plain old JSON, since nothing else in this chapter
hinges on the capabilities of a more advanced storage medium, such as
CouchDB. As always, you can choose to store the posts anywhere you’d
Example 8-1. Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)
# -*- coding: utf-8 -*- import os import sys from datetime import datetime as dt import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html # Example feed: # http://feeds.feedburner.com/oreilly/radar/atom FEED_URL = sys.argv def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries.title), fp.feed.title) blog_posts =  for e in ...