Training a chunker can be a great alternative to manually specifying regular expression chunk patterns. Instead of a pain-staking process of trial and error to get the exact right patterns, we can use existing corpus data to train chunkers much like we did for part-of-speech tagging in the previous chapter.
As with the part-of-speech tagging, we'll use the
treebank corpus data for training. But this time, we'll use the
treebank_chunk corpus, which is specifically formatted to produce chunked sentences in the form of trees. These
chunked_sents() methods will be used by a
TagChunker class to train a tagger-based chunker. The
TagChunker class uses a helper function,
conll_tag_chunks(), to extract a list ...