Netflix is an online DVD rental company that lets people choose movies to be sent to their homes, and makes recommendations based on the movies that customers have previously rented. In late 2006 it announced a prize of $1 million to the first person to improve the accuracy of its recommendation system by 10 percent, along with progress prizes of $50,000 to the current leader each year for as long as the contest runs. Thousands of teams from all over the world entered and, as of April 2007, the leading team has managed to score an improvement of 7 percent. By using data about which movies each customer enjoyed, Netflix is able to recommend movies to other customers that they may never have even heard of and keep them coming back for more. Any way to improve its recommendation system is worth a lot of money to Netflix.
The search engine Google was started in 1998, at a time when there were already several big search engines, and many assumed that a new player would never be able to take on the giants. The founders of Google, however, took a completely new approach to ranking search results by using the links on millions of web sites to decide which pages were most relevant. Google's search results were so much better than those of the other players that by 2004 it handled 85 percent of searches on the Web. Its founders are now among the top 10 richest people in the world.
What do these two companies have in common? They both drew new conclusions and created new business opportunities by using sophisticated algorithms to combine data collected from many different people. The ability to collect information and the computational power to interpret it has enabled great collaboration opportunities and a better understanding of users and customers. This sort of work is happening all over the place—dating sites want to help people find their best match more quickly, companies that predict changes in airplane ticket prices are cropping up, and just about everyone wants to understand their customers better in order to create more targeted advertising.
These are just a few examples in the exciting field of collective intelligence, and the proliferation of new services means there are new opportunities appearing every day. I believe that understanding machine learning and statistical methods will become ever more important in a wide variety of fields, but particularly in interpreting and organizing the vast amount of information that is being created by people all over the world.
People have used the phrase collective intelligence for decades, and it has become increasingly popular and more important with the advent of new communications technologies. Although the expression may bring to mind ideas of group consciousness or supernatural phenomena, when technologists use this phrase they usually mean the combining of behavior, preferences, or ideas of a group of people to create novel insights.
Collective intelligence was, of course, possible before the Internet. You don't need the Web to collect data from disparate groups of people, combine it, and analyze it. One of the most basic forms of this is a survey or census. Collecting answers from a large group of people lets you draw statistical conclusions about the group that no individual member would have known by themselves. Building new conclusions from independent contributors is really what collective intelligence is all about.
A well-known example is financial markets, where a price is not set by one individual or by a coordinated effort, but by the trading behavior of many independent people all acting in what they believe is their own best interest. Although it seems counterintuitive at first, futures markets, in which many participants trade contracts based on their beliefs about future prices, are considered to be better at predicting prices than experts who independently make projections. This is because these markets combine the knowledge, experience, and insight of thousands of people to create a projection rather than relying on a single person's perspective.
Although methods for collective intelligence existed before the Internet, the ability to collect information from thousands or even millions of people on the Web has opened up many new possibilities. At all times, people are using the Internet for making purchases, doing research, seeking out entertainment, and building their own web sites. All of this behavior can be monitored and used to derive information without ever having to interrupt the user's intentions by asking him questions. There are a huge number of ways this information can be processed and interpreted. Here are a couple of key examples that show the contrasting approaches:
Wikipedia is an online encyclopedia created entirely from user contributions. Any page can be created or edited by anyone, and there are a small number of administrators who monitor repeated abuses. Wikipedia has more entries than any other encyclopedia, and despite some manipulation by malicious users, it is generally believed to be accurate on most subjects. This is an example of collective intelligence because each article is maintained by a large group of people and the result is an encyclopedia far larger than any single coordinated group has been able to create. The Wikipedia software does not do anything particularly intelligent with user contributions—it simply tracks the changes and displays the latest version.
Google, mentioned earlier, is the world's most popular Internet search engine, and was the first search engine to rate web pages based on how many other pages link to them. This method of rating takes information about what thousands of people have said about a particular web page and uses that information to rank the results in a search. This is a very different example of collective intelligence. Where Wikipedia explicitly invites users of the site to contribute, Google extracts the important information from what web-content creators do on their own sites and uses it to generate scores for its users.
While Wikipedia is a great resource and an impressive example of collective intelligence, it owes its existence much more to the user base that contributes information than it does to clever algorithms in the software. This book focuses on the other end of the spectrum, covering algorithms like Google's PageRank, which take user data and perform calculations to create new information that can enhance the user experience. Some data is collected explicitly, perhaps by asking people to rate things, and some is collected casually, for example by watching what people buy. In both cases, the important thing is not just to collect and display the information, but to process it in an intelligent way and generate new information.
This book will show you ways to collect data through open APIs, and it will cover a variety of machine-learning algorithms and statistical methods. This combination will allow you to set up collective intelligence methods on data collected from your own applications, and also to collect and experiment with data from other places.