Cover by Toby Segaran

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Word Vectors

The normal way of preparing data for clustering is to determine a common set of numerical attributes that can be used to compare the items. This is very similar to what was shown in Chapter 2, when critics' rankings were compared over a common set of movies, and when the presence or absence of a bookmark was translated to a 1 or a 0 for del.icio.us users.

Pigeonholing the Bloggers

This chapter will work through a couple of example datasets. In the first dataset, the items that will be clustered are a set of 120 of the top blogs, and the data they'll be clustered on is the number of times a particular set of words appears in each blog's feed. A small subset of what this looks like is shown in Table 3-1.

Table 3-1. Subset of blog word frequencies

 

"china"

"kids"

"music"

"yahoo"

Gothamist

0

3

3

0

GigaOM

6

0

0

2

Quick Online Tips

0

2

2

22

By clustering blogs based on word frequencies, it might be possible to determine if there are groups of blogs that frequently write about similar subjects or write in similar styles. Such a result could be very useful in searching, cataloging, and discovering the huge number of blogs that are currently online.

To generate this dataset, you'll be downloading the feeds from a set of blogs, extracting the text from the entries, and creating a table of word frequencies. If you'd like to skip the steps for creating the dataset, you can download it from http://kiwitobes.com/clusters/blogdata.txt.

Counting the Words in a Feed

Almost all blogs can be read online or via ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required