Every algorithm in the book is motivated by a realistic problem that can, I hope, be easily understood by all readers. I have tried to avoid problems that require a great deal of domain knowledge, and I have focused on problems that, while complex, are easy for most people to relate to.
Explains the concepts behind machine learning, how it is applied in many different fields, and how it can be used to draw new conclusions from data gathered from many different people.
Introduces the collaborative filtering techniques used by many online retailers to recommend products or media. The chapter includes a section on recommending links to people from a social bookmarking site, and building a movie recommendation system from the MovieLens dataset.
Builds on some of the ideas in Chapter 2 and introduces two different methods of clustering, which automatically detect groups of similar items in a large dataset. This chapter demonstrates the use of clustering to find groups on a set of popular weblogs and on people's desires from a social networking web site.
Describes the various parts of a search engine including the crawler, indexer, and query engine. It covers the PageRank algorithm for scoring pages based on inbound links and shows you how to create a neural network that learns which keywords are associated with different results.
Introduces algorithms for optimization, which are designed to search millions of possible solutions to a problem and choose the best one. The wide variety of uses for these algorithms is demonstrated with examples that find the best flights for a group of people traveling to the same location, find the best way of matching students to dorms, and lay out a network with the minimum number of crossed lines.
Demonstrates Bayesian filtering, which is used in many free and commercial spam filters for automatically classifying documents based on the type of words and other features that appear in the document. This is applied to a set of RSS search results to demonstrate automatic classification of the entries.
Introduces decision trees as a method not only of making predictions, but also of modeling the way the decisions are made. The first decision tree is built with hypothetical data from server logs and is used to predict whether or not a user is likely to become a premium subscriber. The other examples use data from real web sites to model real estate prices and "hotness."
Approaches the problem of predicting numerical values rather than classifications using k-nearest neighbors techniques, and applies the optimization algorithms from Chapter 5. These methods are used in conjunction with the eBay API to build a system for predicting eventual auction prices for items based on a set of properties.
Shows how support-vector machines can be used to match people in online dating sites or when searching for professional contacts. Support-vector machines are a fairly advanced technique and this chapter compares them to other methods.
Introduces a relatively new technique called non-negative matrix factorization, which is used to find the independent features in a dataset. In many datasets the items are constructed as a composite of different features that we don't know in advance; the idea here is to detect these features. This technique is demonstrated on a set of news articles, where the stories themselves are used to detect themes, one or more of which may apply to a given story.
Introduces genetic programming, a very sophisticated set of techniques that goes beyond optimization and actually builds algorithms using evolutionary ideas to solve a particular problem. This is demonstrated by a simple game in which the computer is initially a poor player that improves its skill by improving its own code the more the game is played.
Reviews all the machine-learning and statistical algorithms described in the book and compares them to a set of artificial problems. This will help you understand how they work and visualize the way that each of them divides data.
Gives information on third-party libraries used in the book, such as where to find them and how to install them.
Contains formulae, descriptions, and code for many of the mathematical concepts introduced throughout the book.
Exercises at the end of each chapter give ideas of ways to extend the algorithms and make them more powerful.