Further Reading
Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web. For more examples of chunking with NLTK, please see the Chunking HOWTO at http://www.nltk.org/howto.
The popularity of chunking is due in great part to pioneering work by Abney, e.g., (Abney, 1996a). Abney’s Cass chunker is described in http://www.vinartus.net/spa/97a.pdf.
The word chink initially meant a sequence of stopwords, according to a 1975 paper by Ross and Tukey (Abney, 1996a).
The IOB format (or sometimes BIO
Format) was developed for NP
chunking by (Ramshaw & Marcus, 1995),
and was used for the shared NP
bracketing task run by the Conference on Natural Language
Learning (CoNLL) in 1999. The same format was adopted by
CoNLL 2000 for annotating a section of Wall Street
Journal text as part of a shared task on NP
chunking.
Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006).
For more information on the Getty and Alexandria gazetteers, see http://en.wikipedia.org/wiki/Getty_Thesaurus_of_Geographic_Names and http://www.alexandria.ucsb.edu/gazetteer/.
Get Natural Language Processing with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.