Cover by Matthew A. Russell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

O'Reilly logo

Sentence Detection in Blogs with NLTK

Given that sentence detection is probably the first task you’ll want to ponder when building an NLP stack, it makes sense to start there. Even if you never complete the remaining tasks in the pipeline, it turns out that EOS detection alone yields some powerful possibilities such as document summarization, which we’ll be considering as a follow-up exercise. But first, we’ll need to fetch some high-quality blog data. Let’s use the tried and true feedparser module, which you can easy_install if you don’t have it already, to fetch some posts from the O’Reilly Radar blog. The listing in Example 8-1 fetches a few posts and saves them to a local file as plain old JSON, since nothing else in this chapter hinges on the capabilities of a more advanced storage medium, such as CouchDB. As always, you can choose to store the posts anywhere you’d like.

Example 8-1. Harvesting blog data by parsing feeds (blogs_and_nlp__get_feed.py)

# -*- coding: utf-8 -*- import os import sys from datetime import datetime as dt import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html # Example feed: # http://feeds.feedburner.com/oreilly/radar/atom FEED_URL = sys.argv[1] def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries[0].title), fp.feed.title) blog_posts = [] for e in ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required