By convention in NLTK, a tagged token is represented using a
tuple consisting of the token and the tag. We can create one of these
special tuples from the standard string representation of a tagged
token, using the function
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token 'fly' >>> tagged_token 'NN'
We can construct a list of tagged tokens directly from a string.
The first step is to tokenize the string to access the individual
word/tag strings, and then to convert each of these into a tuple
>>> sent = ''' ... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN ... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC ... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS ... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB ... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT ... interest/NN of/IN both/ABX governments/NNS ''/'' ./. ... ''' >>> [nltk.tag.str2tuple(t) for t in sent.split()] [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here’s an example of what you might see if you opened a file from the Brown Corpus with a text editor:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl ...