You are previewing Algorithms of the Intelligent Web.
O'Reilly logo
Algorithms of the Intelligent Web

Book Description

Web 2.0 applications are best known for providing a rich user experience, but the parts you can't see are just as important-and impressive. Many Web 2.0 applications use powerful techniques to process information intelligently and offer features based on patterns and relationships in the data that couldn't be discovered manually. Successful examples of these Algorithms of the Intelligent Web include household names like Google Ad Sense, Netflix, and Amazon. These applications use the internet as a platform that not only gathers data at an ever-increasing pace but also systematically transforms the raw data into actionable information.

Algorithms of the Intelligent Web is an example-driven blueprint for creating applications that collect, analyze, and act on the massive quantities of data users leave in their wake as they use the web. You'll learn how to build Amazon- and Netflix-style recommendation engines, and how the same techniques apply to people matches on social-networking sites. See how click-trace analysis can result in smarter ad rotations. With a plethora of examples and extensive detail, this book shows you how to build Web 2.0 applications that are as smart as your users.

Table of Contents

  1. Copyright
  2. Preface
  3. Acknowledgments
    1. H. Marmanis
    2. D. Babenko
  4. About this book
    1. Roadmap
    2. Who should read this book
    3. Code Conventions
    4. Author Online
    5. About the cover illustration
  5. 1. What is the intelligent web?
    1. 1.1. Examples of intelligent web applications
    2. 1.2. Basic elements of intelligent applications
    3. 1.3. What applications can benefit from intelligence?
      1. 1.3.1. Social networking sites
      2. 1.3.2. Mashups
      3. 1.3.3. Portals
      4. 1.3.4. Wikis
      5. 1.3.5. Media-sharing sites
      6. 1.3.6. Online gaming
    4. 1.4. How can I build intelligence in my own application?
      1. 1.4.1. Examine your functionality and your data
      2. 1.4.2. Get more data from the web
    5. 1.5. Machine learning, data mining, and all that
    6. 1.6. Eight fallacies of intelligent applications
      1. 1.6.1. Fallacy #1: Your data is reliable
      2. 1.6.2. Fallacy #2: Inference happens instantaneously
      3. 1.6.3. Fallacy #3: The size of data doesn't matter
      4. 1.6.4. Fallacy #4: Scalability of the solution isn't an issue
      5. 1.6.5. Fallacy #5: Apply the same good library everywhere
      6. 1.6.6. Fallacy #6: The computation time is known
      7. 1.6.7. Fallacy #7: Complicated models are better
      8. 1.6.8. Fallacy #8: There are models without bias
    7. 1.7. Summary
  6. 1.8. References
  7. 2. Searching
    1. 2.1. Searching with Lucene
      1. 2.1.1. Understanding the Lucene code
      2. 2.1.2. Understanding the basic stages of search
    2. 2.2. Why search beyond indexing?
    3. 2.3. Improving search results based on link analysis
      1. 2.3.1. An introduction to PageRank
      2. 2.3.2. Calculating the PageRank vector
      3. 2.3.3. alpha: The effect of teleportation between web pages
      4. 2.3.4. Understanding the power method
      5. 2.3.5. Combining the index scores and the PageRank scores
    4. 2.4. Improving search results based on user clicks
      1. 2.4.1. A first look at user clicks
      2. 2.4.2. Using the NaiveBayes classifier
      3. 2.4.3. Combining Lucene indexing, PageRank, and user clicks
    5. 2.5. Ranking Word, PDF, and other documents without links
      1. 2.5.1. An introduction to DocRank
      2. 2.5.2. The inner workings of DocRank
    6. 2.6. Large-scale implementation issues
    7. 2.7. Is what you got what you want? Precision and recall
    8. 2.8. Summary
    9. 2.9. To do
  8. 2.10. References
  9. 3. Creating suggestions and recommendations
    1. 3.1. An online music store: the basic concepts
      1. 3.1.1. The concepts of distance and similarity
      2. 3.1.2. A closer look at the calculation of similarity
      3. 3.1.3. Which is the best similarity formula?
    2. 3.2. How do recommendation engines work?
      1. 3.2.1. Recommendations based on similar users
      2. 3.2.2. Recommendations based on similar items
      3. 3.2.3. Recommendations based on content
    3. 3.3. Recommending friends, articles, and news stories
      1. 3.3.1. Introducing MyDiggSpace.com
      2. 3.3.2. Finding friends
      3. 3.3.3. The inner workings of DiggDelphi
    4. 3.4. Recommending movies on a site such as Netflix.com
      1. 3.4.1. An introduction of movie datasets and recommenders
      2. 3.4.2. Data normalization and correlation coefficients
    5. 3.5. Large-scale implementation and evaluation issues
    6. 3.6. Summary
    7. 3.7. To Do
  10. 3.8. References
  11. 4. Clustering: grouping things together
    1. 4.1. The need for clustering
      1. 4.1.1. User groups on a website: a case study
      2. 4.1.2. Finding groups with a SQL order by clause
      3. 4.1.3. Finding groups with array sorting
    2. 4.2. An overview of clustering algorithms
      1. 4.2.1. Clustering algorithms based on cluster structure
      2. 4.2.2. Clustering algorithms based on data type and structure
      3. 4.2.3. Clustering algorithms based on data size
    3. 4.3. Link-based algorithms
      1. 4.3.1. The dendrogram: a basic clustering data structure
      2. 4.3.2. A first look at link-based algorithms
      3. 4.3.3. The single-link algorithm
      4. 4.3.4. The average-link algorithm
      5. 4.3.5. The minimum-spanning-tree algorithm
    4. 4.4. The k-means algorithm
      1. 4.4.1. A first look at the k-means algorithm
      2. 4.4.2. The inner workings of k-means
    5. 4.5. Robust Clustering Using Links (ROCK)
      1. 4.5.1. Introducing ROCK
      2. 4.5.2. Why does ROCK rock?
    6. 4.6. DBSCAN
      1. 4.6.1. A first look at density-based algorithms
      2. 4.6.2. The inner workings of DBSCAN
    7. 4.7. Clustering issues in very large datasets
      1. 4.7.1. Computational complexity
      2. 4.7.2. High dimensionality
    8. 4.8. Summary
    9. 4.9. To Do
  12. 4.10. References
  13. 5. Classification: placing things where they belong
    1. 5.1. The need for classification
    2. 5.2. An overview of classifiers
      1. 5.2.1. Structural classification algorithms
      2. 5.2.2. Statistical classification algorithms
      3. 5.2.3. The lifecycle of a classifier
    3. 5.3. Automatic categorization of emails and spam filtering
      1. 5.3.1. NaïveBayes classification
      2. 5.3.2. Rule-based classification
    4. 5.4. Fraud detection with neural networks
      1. 5.4.1. A use case of fraud detection in transactional data
      2. 5.4.2. Neural networks overview
      3. 5.4.3. A neural network fraud detector at work
      4. 5.4.4. The anatomy of the fraud detector neural network
      5. 5.4.5. A base class for building general neural networks
    5. 5.5. Are your results credible?
    6. 5.6. Classification with very large datasets
    7. 5.7. Summary
    8. 5.8. To do
  14. 5.9. References
    1. Classification schemes
    2. Books and articles
  15. 6. Combining classifiers
    1. 6.1. Credit worthiness: a case study for combining classifiers
      1. 6.1.1. A brief description of the data
      2. 6.1.2. Generating artificial data for real problems
    2. 6.2. Credit evaluation with a single classifier
      1. 6.2.1. The naïve Bayes baseline
      2. 6.2.2. The decision tree baseline
      3. 6.2.3. The neural network baseline
    3. 6.3. Comparing multiple classifiers on the same data
      1. 6.3.1. McNemar's test
      2. 6.3.2. The difference of proportions test
      3. 6.3.3. Cochran's Q test and the F test
    4. 6.4. Bagging: bootstrap aggregating
      1. 6.4.1. The bagging classifier at work
      2. 6.4.2. A look under the hood of the bagging classifier
      3. 6.4.3. Classifier ensembles
    5. 6.5. Boosting: an iterative improvement approach
      1. 6.5.1. The boosting classifier at work
      2. 6.5.2. A look under the hood of the boosting classifier
    6. 6.6. Summary
    7. 6.7. To Do
  16. 6.8. References
  17. 7. Putting it all together: an intelligent news portal
    1. 7.1. An overview of the functionality
    2. 7.2. Getting and cleansing content
      1. 7.2.1. Get set. Get ready. Crawl the Web!
      2. 7.2.2. Review of the search prerequisites
      3. 7.2.3. A default set of retrieved and processed news stories
    3. 7.3. Searching for news stories
    4. 7.4. Assigning news categories
      1. 7.4.1. Order matters!
      2. 7.4.2. Classifying with the NewsProcessor class
      3. 7.4.3. Meet the classifier
      4. 7.4.4. Classification strategy: going beyond low-level assignments
    5. 7.5. Building news groups with the NewsProcessor class
      1. 7.5.1. Clustering general news stories
      2. 7.5.2. Clustering news stories within a news category
    6. 7.6. Dynamic content based on the user's ratings
    7. 7.7. Summary
    8. 7.8. To do
  18. 7.9. References
  19. A. Introduction to BeanShell
    1. A.1. What is BeanShell?
    2. A.2. Why use BeanShell?
    3. A.3. Running BeanShell
  20. A.4. References
  21. B. Web crawling
    1. B.1. An overview of crawler components
      1. B.1.1. The stages of crawling
      2. B.1.2. Our simple crawler
      3. B.1.3. Open source web crawlers
  22. B.2. References
  23. C. Mathematical refresher
    1. C.1. Vectors and matrices
    2. C.2. Measuring distances
    3. C.3. Advanced matrix methods
  24. C.4. References
  25. D. Natural language processing
  26. D.1. References
  27. E. Neural networks
  28. E.1. References