Parsing builds trees over sentences, according to a phrase structure grammar. Now, all the examples we gave earlier only involved toy grammars containing a handful of productions. What happens if we try to scale up this approach to deal with realistic corpora of language? In this section, we will see how to access treebanks, and look at the challenge of developing broad-coverage grammars.
corpus module defines the
treebank corpus reader, which contains a 10% sample of the Penn
>>> from nltk.corpus import treebank >>> t = treebank.parsed_sents('wsj_0001.mrg') >>> print t (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .))
We can use this data to help develop a grammar. For example, the
program in Example 8-18 uses a simple
filter to find verbs that take sentential complements. Assuming we
already have a production of the form
-> SV S, this information enables us to identify
particular verbs that would be included in the expansion of
Example 8-18. Searching a treebank to find sentential complements.
def filter(tree): child_nodes = [child.node for child in tree if isinstance(child, nltk.Tree)] return (tree.node == 'VP') and ('S' in child_nodes)
>>> from nltk.corpus import treebank >>> [subtree for tree in treebank.parsed_sents() ...