We introduced frequency distributions in Computing with Language: Simple Statistics. We saw that
given some list
mylist of words or
FreqDist(mylist) would compute the number of occurrences of each item in the list. Here
we will generalize this idea.
When the texts of a corpus are divided into several categories (by
genre, topic, author, etc.), we can maintain separate frequency
distributions for each category. This will allow us to study systematic
differences between the categories. In the previous section, we achieved
this using NLTK’s
ConditionalFreqDist data type. A conditional
frequency distribution is a collection of frequency
distributions, each one for a different “condition.” The condition will
often be the category of the text. Figure 2-4 depicts
a fragment of a conditional frequency distribution having just two
conditions, one for news text and one for romance text.
Figure 2-4. Counting words appearing in a text collection (a conditional frequency distribution).