Cover by Toby Segaran

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Clusters of Preferences

One of the best things about the growing interest in social networking sites is that big sets of data are becoming available, all contributed voluntarily by people. One such site is called Zebo (http://www.zebo.com), which encourages people to create accounts and make lists of things that they own and things that they would like to own. From an advertiser's or social critic's perspective, this is very interesting information, as it can allow them to determine the way that expressed preferences naturally group together.

Getting and Preparing the Data

This section will go through the process of creating a dataset from the Zebo web site. It involves downloading many pages from the site and parsing them to extract what each user says they want. If you would like to skip this section, you can download a precreated dataset from http://kiwitobes.com/clusters/zebo.txt.

Beautiful Soup

Beautiful Soup is an excellent library for parsing a web page and building a structured representation. It allows you to access any element of the page by type, ID, or any of its properties, and get a string representation of its contents. Beautiful Soup is also very tolerant of web pages with broken HTML, which is useful when generating datasets from web sites.

You can download Beautiful Soup from http://crummy.com/software/BeautifulSoup. It comes as a single Python file, which you can put in your Python library path or in the path where you'll be working and starting the Python interpreter. ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required