Further Reading

Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web. For more examples of chunking with NLTK, please see the Chunking HOWTO at http://www.nltk.org/howto.

The popularity of chunking is due in great part to pioneering work by Abney, e.g., (Abney, 1996a). Abney’s Cass chunker is described in http://www.vinartus.net/spa/97a.pdf.

The word chink initially meant a sequence of stopwords, according to a 1975 paper by Ross and Tukey (Abney, 1996a).

The IOB format (or sometimes BIO Format) was developed for NP chunking by (Ramshaw & Marcus, 1995), and was used for the shared NP bracketing task run by the Conference on Natural Language Learning (CoNLL) in 1999. The same format was adopted by CoNLL 2000 for annotating a section of Wall Street Journal text as part of a shared task on NP chunking.

Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006).

For more information on the Getty and Alexandria gazetteers, see http://en.wikipedia.org/wiki/Getty_Thesaurus_of_Geographic_Names and http://www.alexandria.ucsb.edu/gazetteer/.

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.