Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information, such as part-of-speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, if we have defined a text my_text, then vocab = sorted(set(my_text)) builds the vocabulary of my_text, whereas word_freq = FreqDist(my_text) counts the frequency of each word in the text. Both vocab and word_freq are simple lexical resources. Similarly, a concordance like the one we saw in Computing with Language: Texts and Words gives us information about word usage that might help in the preparation of a dictionary. Standard terminology for lexicons is illustrated in Figure 2-5. A lexical entry consists of a headword (also known as a lemma) along with additional information, such as the part-of-speech and the sense definition. Two distinct words having the same spelling are called homonyms.

Lexicon terminology: Lexical entries for two lemmas having the same spelling (homonyms), providing part-of-speech and gloss information.

Figure 2-5. Lexicon terminology: Lexical entries for two lemmas having the same spelling (homonyms), providing part-of-speech and gloss information.

The simplest kind of lexicon is nothing more than a sorted list of words. Sophisticated lexicons include complex structure within and across the individual entries. In this section, we’ll look at some lexical resources included with NLTK.

Wordlist Corpora

NLTK includes ...

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.