Now you have a taste of what chunking does, but we haven’t explained how to evaluate chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at the mechanics of converting IOB format into an NLTK tree, then at how this is done on a larger scale using a chunked corpus. We will see how to score the accuracy of a chunker relative to a corpus, then look at some more data-driven ways to search for NP chunks. Our focus throughout will be on expanding the coverage of a chunker.
corpora module we can load Wall Street
Journal text that has been tagged then chunked using the
IOB notation. The chunk categories provided in this corpus are
PP. As we have seen, each sentence is
represented using multiple lines, as shown here:
he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP ...
A conversion function
chunk.conllstr2tree() builds a tree
representation from one of these multiline strings. Moreover, it
permits us to choose any subset of the three chunk types to use, here
>>> text = ''' ... he PRP B-NP ... accepted VBD B-VP ... the DT B-NP ... position NN I-NP ... of IN B-PP ... vice NN B-NP ... chairman NN I-NP ... of IN B-PP ... Carlyle NNP B-NP ... Group NNP I-NP ... , , O ... a DT B-NP ... merchant NN I-NP ... banking NN I-NP ... concern NN I-NP ... . . O ... ''' >>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
We can use ...