O'Reilly logo

Mining the Social Web by Matthew A. Russell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Bigram Analysis

As previously mentioned, one issue that is frequently overlooked in unstructured text processing is the tremendous amount of information gained when you’re able to look at more than one token at a time, because so many concepts we express are phrases and not just single words. For example, if someone were to tell you that a few of the most common terms in a post are “open”, “source”, and “government”, could you necessarily say that the text is probably about “open source”, “open government”, both, or neither? If you had a priori knowledge of the author or content, you could probably make a good guess, but if you were relying totally on a machine to try to classify the nature of a document as being about collaborative software development or transformational government, you’d need to go back to the text and somehow determine which of the words most frequently occur after “open”—i.e., you’d like to find the collocations that start with the token “open”.

Recall from Chapter 6 that an n-gram is just a terse way of expressing each possible consecutive sequence of n tokens from a text, and it provides the foundational data structure for computing collocations. There are always (n-1) n-grams for any value of n, and if you were to consider all of the bigrams (2-grams) for the sequence of tokens ["Mr.", "Green", "killed", "Colonel", "Mustard"], you’d have four possibilities: [("Mr.", "Green"), ("Green", "killed"), ("killed", "Colonel"), ("Colonel", "Mustard")]. You’d need ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required