As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open”, “source”, and “government”, could you necessarily say that the text is probably about “open source”, “open government”, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the words most frequently occur after “open”—i.e., you’d like to find the collocations that start with the token “open”.
Recall from Chapter 6 that an
n-gram is just a terse way of expressing each
possible consecutive sequence of n tokens from a
text, and it provides the foundational data structure for computing
collocations. There are always (n-1)
n-grams for any value of n, and
if you were to consider all of the bigrams (2-grams) for the sequence of
["Mr.", "Green", "killed", "Colonel",
"Mustard"], you’d have four possibilities:
[("Mr.", "Green"), ("Green", "killed"), ("killed",
"Colonel"), ("Colonel", "Mustard")]. You’d need ...